# nonbacktracking_graph_neural_networks__52c14d7e.pdf Published in Transactions on Machine Learning Research (09/2024) Non-backtracking Graph Neural Networks Seonghyun Park1 , Narae Ryu2 , Gahee Kim2, Dongyeop Woo1, Se-Young Yun2 , Sungsoo Ahn1 {shpark26, dongyeop.woo, sungsoo.ahn}@postech.ac.kr, {nrryu, gaheekim, yunseyoung}@kaist.ac.kr 1POSTECH 2KAIST Reviewed on Open Review: https: // openreview. net/ forum? id= 64Hd QKny Tc The celebrated message-passing updates for graph neural networks allow representing largescale graphs with local and computationally tractable updates. However, the updates suffer from backtracking, i.e., a message flowing through the same edge twice and revisiting the previously visited node. Since the number of message flows increases exponentially with the number of updates, the redundancy in local updates prevents the graph neural network from accurately recognizing a particular message flow relevant for downstream tasks. In this work, we propose to resolve such a redundancy issue via the non-backtracking graph neural network (NBA-GNN) that updates a message without incorporating the message from the previously visited node. We theoretically investigate how NBA-GNN alleviates the over-squashing of GNNs, and establish a connection between NBA-GNN and the impressive performance of non-backtracking updates for stochastic block model recovery. Furthermore, we empirically verify the effectiveness of our NBA-GNN on the long-range graph benchmark and transductive node classification problems. 1 Introduction Non-backtracking Simple Step 1 Step 2 Figure 1: Comparison of two types of message flow. Step 1: Message flow from node i to node j. Step 2: The simple update includes the message flow from node j back to node i (left) at step 1, while the nonbacktracking update removes this redundant message flow (right). Recently, graph neural networks (GNNs) (Kipf & Welling, 2017; Hamilton et al., 2017; Xu et al., 2019) have shown great success in various applications, such as molecular property prediction (Gilmer et al., 2017) and community detection (Bruna & Li, 2017). Such success can be largely attributed to the message-passing structure of GNNs, which provides a computationally tractable way of incorporating the overall graph through iterative updates based on local neighborhoods. However, the message-passing structure also brings challenges due to the parallel updates and memoryless behavior of messages passed along the graph. In particular, the message flow in a GNN is prone to backtracking, where the message from vertex i to vertex j is reincorporated in the subsequent message from j to i, e.g. the left image of step 2 in Figure 1. Since the message-passing iteratively aggregates the information, the GNN inevitably encounters an exponential surge in the number of message flows, proportionate to the vertex degrees. This issue is compounded by backtracking, which accelerates the growth of message flows with redundant information. Equal Contribution, Co-corresponding author. Published in Transactions on Machine Learning Research (09/2024) The effectiveness of non-backtracking updates has been extensively explored in non-GNN message-passing algorithms or random walks (Fitzner & van der Hofstad, 2013; Rappaport et al., 2017) (Figure 1). For example, given a pair of vertices i, j, the belief propagation algorithm (Pearl, 1982) forbids an i j message from incorporating the j i message. Researchers have similarly considered non-Markovian walks (Alon et al., 2007), i.e., walks that do not consecutively traverse the same edge twice. Such classic algorithms have demonstrated great success in applications such as probabilistic graphical model inference and stochastic block models (Bordenave et al., 2015). In particular, the spectrum of the non-backtracking operator contains more useful information than that of the adjacency matrix in revealing the hidden structure of a graph model (Bordenave et al., 2015). However, despite their promise, non-backtracking updates have received limited attention in the context of GNNs. While Chen et al. (2018) have considered combining non-backtracking updates with the typical GNN updates, the updates are not designed to prevent backtracking updates, and the analysis is limited to the optimization landscape of linear GNNs. Hence, a thorough investigation of the benefits of non-backtracking updates in GNNs is warranted. Contribution. In this paper, we propose to use a non-backtracking graph neural network (NBA-GNN) to resolve the redundant messages, the over-squashing phenomenon. We employ non-backtracking updates on the messages, i.e., forbid a message from vertex i to vertex j from being incorporated in the message from vertex j to i. To this end, we associate hidden features with transitions between a pair of vertices, e.g., hj i, and update them from features associated with non-backtracking transitions, e.g., hk j for k = i. To motivate our work, we formulate message flows as the sensitivity of a GNN concerning walks in the graph. Then, we explain how the message flows are redundant; the GNN s sensitivity of a walk with backtracking transitions is covered by other non-backtracking walks. The redundancy harms the GNN since the number of walks increases exponentially as the number of layers grows and the GNN becomes insensitive to a particular walk information. Hence, reducing the redundancy by simply considering non-backtracking walks would benefit the message-passing updates to recognize each walk s information better. As a motivating example, we provide a connection between our sensitivity analysis to the over-squashing phenomenon for GNN (Topping et al., 2022; Black et al., 2023; Di Giovanni et al., 2023) in terms of access time. Furthermore, we analyze our NBA-GNNs from the perspective of over-squashing and their expressive capability to recover sparse stochastic block models (SBMs). To this end, we prove that NBA-GNN improves the Jacobian-based measure of over-squashing (Topping et al., 2022) compared to its original GNN counterpart. Next, we investigate NBA-GNN s proficiency in node classification within SBMs and its ability to distinguish between graphs originating from the Erdős Rényi model or the SBM, from the results of (Stephan & Massoulié, 2022; Bordenave et al., 2015). Unlike traditional GNNs that operate on adjacency matrices and necessitate an average degree of at least Ω(log n), NBA-GNN demonstrates the ability to perform node classification with a substantially lower average degree bound of ω(1) and no(1). Furthermore, the algorithm can accurately classify graphs even when the average degree remains constant. Finally, we empirically evaluate our NBA-GNN on the long-range graph benchmark (Dwivedi et al., 2022) and transductive node classification problems (Sen et al., 2008; Pei et al., 2019). We observe that our NBA-GNN demonstrates competitive performance and even achieves state-of-the-art performance on the long-range graph benchmark. For the node classification tasks, we demonstrate that NBA-GNN consistently improves over its conventional GNN counterpart. To summarize, our contributions are as follows: We propose NBA-GNN as a solution for the message flow redundancy problem in GNNs. We analyze how NBA-GNN alleviates over-squashing and is expressive enough to recover sparse stochastic block models with an average degree of o(log n). We empirically verify our NBA-GNN to show state-of-the-art performance on the long-range graph benchmark and consistently improve over the conventional GNNs for the transductive node classification tasks. Published in Transactions on Machine Learning Research (09/2024) 2 Related works Non-backtracking Algorithms. Many classical algorithms have considered non-backtracking updates (Newman, 2013; Kempton, 2016). Belief propagation (Pearl, 1982) infers the marginal distribution on probabilistic graphical models, and has demonstrated success for tree graphs (Kim & Pearl, 1983) and graphs with large girth (Murphy et al., 1999). Moreover, Mahé et al. (2004) and Aziz et al. (2013) suggested graph kernels between labeled graphs utilizing non-backtracking walks, and Krzakala et al. (2013) first used it for node classification. Furthermore, the non-backtracking has been shown to yield better spectral separation properties, and its eigenspace contains information about the hidden structure of a graph model (Bordenave et al., 2015; Stephan & Massoulié, 2022). Non-backtracking and GNNs. We also note that there have been similar approaches of applying nonbacktracking to GNNs. Chen et al. (2018) first used the non-backtracking operator in GNNs, though they do not prevent backtracking and only target community detection tasks. Also, Chen et al. (2022) have computed a non-redundant tree for every node to eliminate redundancy, but inevitably suffers from high complexity. We emphasize the distinction between prior works and our NBA-GNN in Appendix A from the lens of (i) computational complexity and (ii) empirical performance. Over-squashing of GNNs. When a node receives information from a k-hop neighbor node, an exponential number of messages pass through node representations with fixed-sized vectors. This leads to the loss of information known as over-squashing (Alon & Yahav, 2021), and has been formalized in terms of sensitivity (Topping et al., 2022; Di Giovanni et al., 2023). Hence, sensitivity is defined as the Jacobian of a final node feature at a GNN layer to the initial node representation and can be upper bounded via the graph topology. Stemming from this, graph rewiring methods alleviate over-squashing by adding or removing edges to compute an optimal graph (Topping et al., 2022; Black et al., 2023; Di Giovanni et al., 2023). Another line of work uses global aspects, e.g., Transformers have been applied to consider global aspects to avoid over-squashing (Ying et al., 2021; Kreuzer et al., 2021; Rampášek et al., 2022; Shirzad et al., 2023). Expressivity of GNNs for the SBM. Certain studies focus on analyzing the expressive power of GNNs using variations of the SBM (Holland et al., 1983). Fountoulakis et al. (2023) established conditions for the existence of graph attention networks (GATs) that can precisely classify nodes in the contextual stochastic block model (CSBM) with high probability. Similarly, Baranwal et al. (2023) investigated the effects of graph convolutions within a network on the XOR-CSBM. The preceding works primarily focused on the probability distribution of node features, such as the distance between the means of feature vectors. On the other hand, Kanatsoulis & Ribeiro (2023) analyzed the expressivity utilizing linear algebraic tools and eigenvalue decomposition of graph operators. 3 Non-backtracking Graph Neural Network 3.1 Motivation from Sensitivity Analysis We first explain how the conventional message-passing updates are prone to backtracking. To be specific, consider a simple, undirected graph G = (V, E) and let N(i) denote the set of neighbor nodes of the node i. Each node i V is associated with a feature xi. Then, the conventional graph neural networks (GNNs), i.e., message-passing neural networks (MPNNs) (Gilmer et al., 2017), iteratively update the node-wise hidden feature at the t-th layer h(t) i as follows: h(t+1) i = ϕ(t) h(t) i , n ψ(t) h(t) i , h(t) j : j N(i) o , (1) where ϕ(t) and ψ(t) are architecture-specific non-linear update and permutation invariant aggregation functions, respectively. Our key observation is that the message from a node feature h(t) i to the node feature h(t+1) j is reincorporated in the node feature h(t+2) i , e.g., Figure 3a shows the computation graph of conventional GNNs with redundant messages. Published in Transactions on Machine Learning Research (09/2024) Sensitivity Analysis. To concretely describe the consequences of backtracking in message-passing updates, we formulate the sensitivity of the final node feature h(T ) i with respect to the input as follows: h(T ) i h(0) j = X h(t) s(t) h(t 1) s(t 1) , (2) where h(0) i = xi, W(i) denotes the set of T-step walks ending at node i, and s(t) denotes the t-th node in the walk s W(i). Intuitively, this equation shows that a GNN with T layers recognize the graph via aggregation of random walks with length T. Our key observation from Equation (2) is on how the feature h(T ) i is insensitive to the information from an initial node feature h(0) j , due to the information being squashed by the aggregation over the exponential number of walks W(i). A similar analysis has been conducted on how a node feature h(T ) i is insensitive to the far-away initial node feature h(0) j = xj, i.e., the over-squashing phenomenon of GNNs (Topping et al., 2022). uc8Uoir Eh Cpubr Xpk Ch C0USUMy G4y+vkmal7F6Wq/e VQu1m EUc WTu AUSu DCFd Tg Dur QAo Cnu EV3qx H68V6tz7mr Rlr MXMf2B9/g ASn Y6ws(t + 4) s(t) = s(t + 2) Figure 2: Two non-backtracking walks (right) are sufficient to express information contained in a walk with backtracking transition (left). Redundancy of walks with backtracking. In particular, a walk s randomly sampled from W(i) is likely to contain a transition that backtracks, i.e., s(t) = s(t+2) for some t < T. The walk s would be redundant since the information is contained in two other walks in W(i): s(0), . . . , s(t+1) and s(0), . . . , s(t)s(t+1), s(t+ 2) = s(t), s(t+3), . . . s(T), as illustrated in Figure 2. This leads to the conclusion that nonbacktracking walks, i.e., walks that do not contain backtracking transitions, are sufficient to express the information in the walks W(i). Since the exponential number of walks in W(i) causes the GNN to be insensitive to a particular walk information, it makes sense to design a non-backtracking GNN that is sensitive to the constrained set of non-backtracking walks. Relation to Over-squashing. Finally, we point out an intriguing motivation for our work in terms of over-squashing. Di Giovanni et al. (2023) analyzed the lower bound for the Jacobian obstruction that measures the degree of over-squashing in terms of access time with respect to a simple random walk. They concluded that the degree of over-squashing, i.e., the size of Jacobian obstruction, is higher for a pair of nodes with longer access time. Hence, for a GNN architecture robust to over-squashing, one could (i) propose a random walk that has shorter access time for a pair of nodes in the graph, and (ii) design a GNN that aligns with the random walk. Since non-backtracking random walks have been empirically shown and believed to generally yield faster access time than simple random walks (Lin & Zhang, 2019; Fasino et al., 2023), one could aim to design a GNN architecture that aligns with the non-backtracking random walks. Access Time of Random Walks. However, to the best of our knowledge, there is no formal proof of scenarios where non-backtracking random walks yield a shorter access time. As a motivating example, we provide a theoretical result comparing the access time of non-backtracking and simple random walks for tree graphs. Since non-backtracking random walks do not guarantee a walk of a certain length, we make use of begrudgingly backtracking random walks (Rappaport et al., 2017), which modifies non-backtracking random walks to remove dead ends for tree graphs. For the full proof, please refer to Appendix B. Proposition 1. Given a tree G = (V, E) and a pair of nodes i, j V, the access time of begrudgingly backtracking random walk is equal to or smaller than that of a simple random walk. The equality holds if and only if the walk length is 1. 3.2 Method Description In this section, we present the Non-BAcktracking GNN (NBA-GNN) with the motivation described in Section 3.1. Given an undirected graph G = (V, E), our NBA-GNN associates a pair of hidden features Published in Transactions on Machine Learning Research (09/2024) 2 2 2 4 ... (a) Computation graph of typical GNN (b) Computation graph of NBA-GNN Figure 3: Computation graph of typical GNN and NBA-GNN predicting node 0 . (a) Redundant messages increase the size of the computation graph, proportional to the number of layers. (b) NBA-GNN assigns a pair of features for each edge and updates them via non-backtracking message passing. By reducing redundant messages, it results in a simplified computation graph compared to typical GNNs. h(t) i j, h(t) j i for each edge {i, j}. Then the non-backtracking message passing update for a hidden feature h(t) j i is defined as follows: h(t+1) j i = ϕ(t) h(t) j i, n ψ(t) h(t) k j, h(t) j i : k N(j) \ {i} o , (3) where ϕ(t) and ψ(t) are backbone-specific non-linear update and permutation-invariant aggregation functions at the t-th layer, respectively. For example, ψ(t) and ϕ(t) are multi-layer perceptron and summation over a set for the graph isomorphism network (Xu et al., 2019, GIN), respectively. Given the update in Equation (3), one can observe that the message h(t) i j is never incorporated in the message h(t+1) j i , and hence the update is free from backtracking. Note that line graph neural networks (Chen et al., 2018, LGNN) also applied non-backtracking operators in GNNs. However, it does not address the issue of redundant messages, as it continues to use the adjacency matrix. Our key contribution is the use of the non-backtracking operator specifically to tackle the message redundancy issue. Please refer to Appendix A for a detailed comparison. Initialization and Node-wise Aggregation of Messages. The message at the 0-th layer h(0) i j is initialized by encoding the node features xi, xj, and the edge feature eij using a non-linear function ϕ. After updating hidden features for each edge based on Equation (3), we apply a permutation-invariant pooling over all the messages for graph-wise predictions. Since we use hidden features for each edge, we construct the node-wise predictions at the final T-th layer as follows: hi = σ ρ n h(T ) j i : j N(i) o , ρ n h(T ) i j : j N(i) o , (4) where σ is a non-linear aggregation function with different weights for incoming edges j i and outgoing edges i j, ρ is a non-linear aggregation function invariant to the permutation of nodes in N(i). We provide a computation graph of NBA-GNN in Figure 3b to summarize our algorithm. Begrudgingly Backtracking Update. While the messages from our update are resistant to backtracking, a message h(t) j i may get trapped in node i for the special case when N(i) = {j}. To resolve this issue, we introduce a simple trick coined begrudgingly backtracking update (Rappaport et al., 2017) that updates h(t+1) i j using h(t) j i only when N(i) = {j}. We empirically verify the effectiveness of begrudgingly backtracking updates in Section 5.3. Implementation. To better understand our NBA-GNN, we provide an example of non-backtracking message-passing updates with a GCN backbone (Kipf & Welling, 2017), coined NBA-GCN. The message update at the t-th layer of NBA-GCN can be written as follows: h(t+1) j i = h(t) j i + σGCN 1 |N(j)| 1W(t) X k N(j)\{i} h(t) k j Published in Transactions on Machine Learning Research (09/2024) where σGCN is an element-wise nonlinear function, e.g., rectified linear unit (Agarap, 2018, Re LU), W(t) is the weight matrix, and messages are normalized by their number of neighbors |N(j)| 1. 4 Theoretical Analysis In this section, we provide a theoretical analysis of the proposed NBA-GNN framework. To be specific, we show that (a) our NBA-GNNs improve the upper bound for sensitivity-based measures of GNN oversquashing and (b) NBA-GNNs can detect the underlying structure of SBMs even for very sparse graphs. 4.1 Sensitivity Analysis on Over-squashing While Chen et al. (2018) initially introduced the non-backtracking operator in GNNs, they did not explore its theoretical implications. Hence, we first analyze how NBA-GNNs alleviate the over-squashing issue. A wellknown quantification to assess the over-squashing effect is the sensitivity bound presented in Proposition 2, i.e., the Jacobian of node-wise output for another initial node feature. Note that Topping et al. (2022) assumes the node features and hidden representations as scalars for better understanding. Proposition 2 (Sensitivity bounds). (Topping et al., 2022) Assume an MPNN defined in Equation (1). Let two nodes i, j V with distance T. If ϕ(t) 1 α and ψ(t) 1 β for 0 t < T, then the sensitivity bound can be defined as the following: h(T ) j xi 1 (αβ)T ( b AT )j,i , (6) where b A denotes the degree-normalized adjacency matrix. Over-squashing occurs when the right-hand side of Equation (6) is too small, i.e., the hidden representation of node j becomes insensitive to the initial feature of node i (Topping et al., 2022; Di Giovanni et al., 2023). To address this, we interpret the topology, i.e., the adjacency matrix, of the sensitivity bound in terms of the random walk that a GNN aligns with. In the following, we show that a non-backtracking random walk yields a higher sensitivity bound than simple random walks. For our analysis, we bring the notation of non-backtracking matrix B {0, 1}2|E| 2|E| and the incidence matrix C R2|E| |V| following Chen et al. (2018) to describe the message-passing of NBA-GNNs and node-wise aggregation via linear operation, respectively. To be specific, the backtracking matrix B and the incidence matrix C are defined as follows: B(ℓ k),(j i) = ( 1 if k = j, ℓ = i 0 otherwise , C(k j),i = ( 1 if j = i or k = i 0 otherwise . We also define Dout and Din as the matrices representing out-degree and in-degree of NBA-GNNs, respectively, capturing the count of outgoing and incoming edges for each edge. These are diagonal matrices with (Dout)(j i),(j i) = P ℓ k B(j i),(ℓ k) and (Din)(j i),(j i) = P ℓ k B(ℓ k),(j i). Next, we introduce b B as the normalized non-backtracking matrix augmented with self-loops: b B = (Dout + I) 1 2 (B + I)(Din + I) 1 2 . Finally, we let C as the matrix where C(k j),i = C(k j),i + C(j k),i. Then, one obtains the following sensitivity bound of NBA-GNNs. Lemma 1 (Sensitivity bounds of NBA-GNNs). Consider two nodes i, j V with a random walk distance T given a (T 1)-layer NBA-GNN as described in Equation (3) and Equation (4). Suppose ϕ(t) 1 , σ 1 α, ψ(t) 1 , ρ 1 β, and ϕ 1 γ for 0 t < T. Then the following holds: hj xi 1 (αβ)T γ( C b B(T 1) C)j,i . We provide the proof in Appendix C. Lemma 1 states how the over-squashing effect is controlled by the power of b B. Consequently, one can infer that increasing the upper bound likely alleviates the over-squashing Published in Transactions on Machine Learning Research (09/2024) effect in GNNs (Topping et al., 2022; Black et al., 2023; Gutteridge et al., 2023) by reshaping the graph topology, i.e., the power of the adjacency matrix in the right-hand side of Equation (6). Additionally, the assumptions on nabla bounds are derived considering the Lipschitz constant of the non-linearity function and the maximum entry value across all weight matrices (Di Giovanni et al., 2023). From this motivation, we provide an analysis to support our claim that the NBA-GNNs suffer less from the over-squashing effect due to its larger sensitivity bound. The key point is that the sensitivity bounds align with a random walk, and the non-backtracking random walks result in a larger sensitivity bound. Proposition 3. Consider an MPNN defined as in Equation (1) and a (T 1)-layer NBA-GNN described by Equation (3) and Equation (4). For any pair of nodes i, j V with distance T, the following inequality holds between sensitivity bounds: ( b AT )j,i ( C b BT 1 C)j,i . For d-regular graphs, ( C b BT 1 C)j,i decays slower by O(d T ), while ( b AT )j,i decays with O((d + 1) T ). We provide the full proof in Appendix C, based on comparing the degree-normalized number of nonbacktracking and simple walks from node i to node j. To the best of our knowledge, we are the first to compare the degree of over-squashing between GNNs aligned with different types of random walks. Hence, Proposition 3 indicates that NBA-GNNs has a larger sensitivity bound compared to conventional GNNs and suffers less from over-squashing, experimentally shown in Table 1 and Figures 4a and 4b. Moreover, for the sensitivity bound of d-regular graphs, consider the case of multiplying the power of b B or b A to a one-hot vector. Since every entry is always identical and smaller in b B, all entries from the resulting vector from the non-backtracking matrix will have larger values than those from the adjacency matrix. 4.2 Expressive Power of NBA-GNN on SBMs In the literature on the expressive capabilities of GNNs, comparisons with the well-known k-WL test are common. However, since the k-WL test only focuses on graph isomorphism, i.e. graph level tasks, it is inadequate for measuring the expressive power in node classification tasks. Furthermore, due to the substantial performance gap between the 1-WL (equivalent to 2-WL) and 3-WL tests, many algorithms fall into the range between these two tests, making it more difficult to compare them with each other (Huang & Villar, 2021; Wang et al., 2023). It is also worth noting that comparing GNNs with the WL test does not always accurately reflect their performance on real-world datasets. To address these issues, several studies have turned to spectral analysis of GNNs. From a spectral viewpoint, GNNs can be seen as functions of the eigenvectors and eigenvalues of the given graph. NT & Maehara (2019) showed that GNNs operate as low-pass filters on the graph spectrum, and Balcilar et al. (2020) analyzed the use of various GNNs as filters to extract the relevant graph spectrum and measure their expressive power. Moreover, Oono & Suzuki (2020) argue that the expressive power of GNNs is influenced by the topological information contained in the graph spectrum. The eigenvalues and the corresponding adjacency matrix eigenvectors play a pivotal role in establishing the fundamental limits of community detection in SBM, as evidenced by Yun & Proutière (2019). The adjacency matrix exhibits a spectral separation property, and an eigenvector containing information about the assignments of the vertex community becomes apparent (Lei & Rinaldo, 2015). Furthermore, by analyzing the eigenvalues of the adjacency matrix, it is feasible to determine whether a graph originates from the Erdős Rényi (ER) model or the SBM (Erdős et al., 2013; Avrachenkov et al., 2015). However, these spectral properties are particularly salient when the average degree of the graph satisfies Ω(log n). For graphs with average degrees o(log n), vertices with higher degrees predominate, affecting eigenvalues and complicating the discovery of the underlying structure of the graph (Benaych-Georges et al., 2019). In contrast, the non-backtracking matrix exhibits several advantageous properties, even for constant-degree cases. In Stephan & Massoulié (2022), the non-backtracking matrix demonstrates a spectral separation property and establishes the presence of an eigenvector containing information about vertex community assignments, when the average degree only satisfies ω(1) and no(1). Furthermore, Bordenave et al. (2015) have demonstrated that by inspecting the eigenvalues of the non-backtracking matrix, it is possible to discern whether a graph originates from the ER model or the SBM, even when the graph s average degree remains Published in Transactions on Machine Learning Research (09/2024) Table 1: Comparison of conventional MPNNs and GNNs in the long-range graph benchmark, with and without Laplacian positional encoding (Lap PE). We also denote the relative improvement by Impr. Model Peptides-func Peptides-struct Pascal VOC-SP AP Impr. MAE Impr. F1 Impr. GCN 0.5930 0.0023 0.3496 0.0013 0.1268 0.0060 + NBA 0.6951 0.0024 +17% 0.2656 0.0009 +22% 0.2537 0.0054 +100% + NBA+Lap PE 0.7206 0.0028 +22% 0.2472 0.0008 +29% 0.3005 0.0010 +137% GIN 0.5498 0.0079 0.3547 0.0045 0.1265 0.0076 + NBA 0.6961 0.0045 +27% 0.2534 0.0025 +29% 0.3040 0.0119 +140% + NBA+Lap PE 0.7071 0.0067 +29% 0.2424 0.0010 +32% 0.3223 0.0010 +155% Gated GCN 0.5864 0.0077 0.3420 0.0013 0.2873 0.0219 + NBA 0.6429 0.0062 +10% 0.2539 0.0011 +26% 0.3910 0.0010 +36% + NBA+Lap PE 0.6982 0.0014 +19% 0.2466 0.0012 +28% 0.3969 0.0027 +38% constant. This capability enhances NBA-GNN s performance in both node and graph classification tasks, especially in sparse settings. These lines of reasoning lead to the formulation of the following propositions. Proposition 4. (Informal) Assume the average degree in the stochastic block model satisfies the conditions of being at least ω(1) and no(1). In such a scenario, NBA-GNN can map from graph G to node labels. Proposition 5. (Informal) Suppose we have a pair of graphs with a constant average degree, one generated from the stochastic block model and the other from the Erdős Rényi model. In this scenario, NBA-GNN is capable of distinguishing between them. Proposition 4 suggests that even if a given graph is too sparse to extract node class information using a GNN with the adjacency matrix, NBA-GNN can still successfully classify the nodes with a probability approaching 1. Similarly, Proposition 5 extends this argument to the graph classification problem. The rationale behind these valuable properties of NBA-GNNs in sparse scenarios lies in the fact that the nonbacktracking matrix Bk exhibits similarity to the k-hop adjacency matrix, while Ak is mainly influenced by high-degree vertices. This enables NBA-GNNs to extract valuable information from the spectrum of the non-backtracking matrix, aiding in the recovery of the hidden structure of the graph. For these reasons, NBA-GNNs would outperform traditional GNNs in both node and graph classification tasks, particularly in sparse graph environments. These propositions integrate prior work on the non-backtracking matrix into GNNs, contributing to a deeper understanding of the expressive power within GNN structures. Such an approach has the potential to advance the field by introducing new perspectives and methodologies within the GNN framework. For an in-depth exploration of this argument, please refer to Appendix D. 5 Experiment In this section, we assess the effectiveness of NBA-GNNs across multiple benchmarks on graph classification, graph regression, and node classification tasks1. Additionally, we conduct a detailed comparison, including a complexity analysis, with existing methods aiming to reduce redundancy (see Appendix A). Detailed experimental information is also provided in Appendix E. 5.1 Long-Range Graph Benchmark The long-range graph benchmark (Dwivedi et al., 2022, LRGB) considers a set of tasks that require learning long-range interactions. We validate our method using three datasets from the LRGB benchmark: Peptides-func (graph classification), Peptides-struct (graph regression), and Pascal VOC-SP (node classification). We adopt performance scores from Dwivedi et al. (2022) for GNNs and from each baseline paper: 1 The code is available at https://github.com/seonghyun26/nba-gnn Published in Transactions on Machine Learning Research (09/2024) Table 2: Evaluation of NBA-GNN on LRGB. The first-, secondand third-best results are colored. Scores within a standard deviation of one another is considered equal. Non-reported values are denoted by -. Method Model Peptides-func Peptides-struct Pascal VOC-SP AP MAE F1 GCN 0.5930 0.0023 0.3496 0.0013 0.1268 0.0060 GCN+Lap PE 0.6213 0.0060 0.3250 0.0180 0.1370 0.0077 GIN 0.5498 0.0079 0.3547 0.0045 0.1265 0.0076 GIN+Lap PE 0.5877 0.0044 0.3369 0.0026 0.1302 0.0105 Gated GCN 0.5864 0.0077 0.3420 0.0013 0.2873 0.0219 Gated GCN+Lap PE 0.6069 0.0035 0.3357 0.0006 0.2860 0.0085 Subgraph GNNs Mix Hop-GCN 0.6592 0.0036 0.2921 0.0023 0.2506 0.0133 Mix Hop-GCN+Lap PE 0.6843 0.0049 0.2614 0.0023 0.2218 0.0174 Path NN 0.6816 0.0026 0.2545 0.0032 - CIN++ 0.6569 0.0117 0.2523 0.0013 - Transformers Transformer+Lap PE 0.6326 0.0126 0.2529 0.0016 0.2694 0.0098 Graph GPS+Lap PE 0.6535 0.0041 0.2500 0.0005 0.3748 0.0109 SAN+Lap PE 0.6384 0.0121 0.2683 0.0043 0.3230 0.0039 Exphormer 0.6527 0.0043 0.2481 0.0007 0.3966 0.0027 Graph MLP-Mixer/Vi T 0.6970 0.0080 0.2449 0.0016 - Graph Rewiring DIGL+MPNN 0.6469 0.0019 0.3173 0.0007 0.2824 0.0039 DIGL+MPNN+Lap PE 0.6830 0.0026 0.2616 0.0018 0.2921 0.0038 DRew-GCN+Lap PE 0.7150 0.0044 0.2536 0.0015 0.1851 0.0092 DRew-GIN+Lap PE 0.7126 0.0045 0.2606 0.0014 0.2692 0.0059 DRew-Gated GCN+Lap PE 0.6977 0.0026 0.2539 0.0007 0.3314 0.0024 State Space Models Graph-Mamba 0.6739 0.0087 0.2478 0.0016 0.4191 0.0126 NBA-GNNs (Ours) NBA-GCN 0.6951 0.0024 0.2656 0.0009 0.2537 0.0054 NBA-GCN+Lap PE 0.7207 0.0028 0.2472 0.0008 0.3005 0.0010 NBA-GIN 0.6961 0.0045 0.2775 0.0057 0.3040 0.0119 NBA-GIN+Lap PE 0.7071 0.0067 0.2424 0.0010 0.3223 0.0063 NBA-Gated GCN 0.6429 0.0062 0.2539 0.0011 0.3910 0.0010 NBA-Gated GCN+Lap PE 0.6982 0.0014 0.2466 0.0012 0.3969 0.0027 subgraph based GNNs (Abu-El-Haija et al., 2019; Michel et al., 2023; Giusti et al., 2023), graph Transformers (Kreuzer et al., 2021; Rampášek et al., 2022; Shirzad et al., 2023; He et al., 2023), graph rewiring methods (Gasteiger et al., 2019; Gutteridge et al., 2023), and state space model (Wang et al., 2024). For NBA-GNNs and NBA-GNNs with begrudgingly backtracking, we report the one with better performance. Furthermore, Lap PE, i.e., Laplacian positional encoding (Dwivedi et al., 2023), is applied as it enhances the performance of NBA-GNNs in common cases. As one can see in Table 1, NBA-GNNs show improvement regardless of the combined backbone GNNs, i.e., GCN (Kipf & Welling, 2017), GIN (Xu et al., 2019), and Gated GCN (Bresson & Laurent, 2018), aligning with our results in Proposition 3. Specifically, NBA-GCN outperforms Gated GCN, (i) confirming that simply updating both node and edge features does not lead to performance improvement and (ii) showing that nonbacktracking is a key component for solving long-range interaction tasks. Furthermore, when compared to a variety of recent baselines in Table 2, at least one NBA-GNN shows competitive performance with the best baseline, except for the state space model for Pascal VOC-SP. It is also noteworthy that the improvement of NBA-GNNs is higher in dense graphs, where Pascal VOC-SP has an average degree of 8 while Peptides-func and Peptides-struct have an average degree of 2. 5.2 Transductive Node Classification Tasks To validate the effectiveness of non-backtracking in transductive node classification tasks, we conduct experiments on three citation networks (Cora, Cite Seer, and Pubmed) (Sen et al., 2008) and three heterophilic Published in Transactions on Machine Learning Research (09/2024) Table 3: Comparison of GNNs and their NBA-GNN counterpart on transductive node classification tasks for citation networks and heterophilic datasets, with and without Laplacian positional encoding (Lap PE). We mark the best numbers in bold. Model Cora Cite Seer Pub Med Texas Wisconsin Cornell GCN 0.8658 0.0060 0.7532 0.0134 0.8825 0.0042 0.6162 0.0634 0.6059 0.0438 0.5946 0.0662 + Lap PE 0.8592 0.0083 0.7572 0.0132 0.8817 0.0040 0.6216 0.0584 0.6000 0.0600 0.5703 0.0547 + NBA 0.8722 0.0095 0.7585 0.0175 0.8826 0.0044 0.7108 0.0796 0.7471 0.0386 0.6108 0.0614 + NBA+Lap PE 0.8720 0.0129 0.7609 0.0186 0.8827 0.0048 0.6811 0.0595 0.7471 0.0466 0.6378 0.0317 Graph SAGE 0.8632 0.0158 0.7559 0.0161 0.8864 0.0030 0.7108 0.0556 0.7706 0.0403 0.6027 0.0625 + Lap PE 0.8700 0.0117 0.7608 0.0144 0.8895 0.0047 0.7162 0.0653 0.7647 0.0453 0.6189 0.0484 + NBA 0.8702 0.0083 0.7586 0.0213 0.8871 0.0044 0.7270 0.0905 0.7765 0.0508 0.6459 0.0691 + NBA+Lap PE 0.8650 0.0120 0.7621 0.0172 0.8870 0.0037 0.7486 0.0612 0.7647 0.0531 0.6378 0.0544 GAT 0.8694 0.0119 0.7463 0.0159 0.8787 0.0046 0.6054 0.0386 0.6000 0.0491 0.4757 0.0614 + Lap PE 0.8686 0.0152 0.7512 0.0154 0.8775 0.0045 0.6135 0.0404 0.6294 0.0448 0.5108 0.0769 + NBA 0.8722 0.0120 0.7549 0.0171 0.8829 0.0043 0.6622 0.0514 0.7059 0.0562 0.5838 0.0558 + NBA+Lap PE 0.8692 0.0098 0.7561 0.0175 0.8822 0.0047 0.6730 0.0348 0.7314 0.0531 0.5784 0.0640 Cheb Net 0.8523 0.0110 0.7399 0.0160 0.8718 0.0029 0.6811 0.0554 0.7098 0.0322 0.6473 0.0520 + Lap PE 0.8531 0.0139 0.7396 0.0241 0.8701 0.0047 0.6324 0.0308 0.6941 0.0905 0.6527 0.0506 + NBA 0.8823 0.0159 0.7379 0.0099 0.8832 0.0062 0.7568 0.0468 0.7490 0.0671 0.6973 0.0483 + NBA+Lap PE 0.8795 0.0217 0.7421 0.0137 0.8821 0.0076 0.7243 0.0520 0.7451 0.0808 0.6757 0.0634 Table 4: Comparison of NBA-GNN architectures and edge representation updating GNN architectures on transductive node classification tasks. We mark the best numbers in bold. Model Cora Cite Seer Pub Med Texas Wisconsin Cornell NBA-GCN 0.8722 0.0095 0.7585 0.0175 0.8826 0.0044 0.7108 0.0796 0.7471 0.0386 0.6108 0.0614 NBA-Graph SAGE 0.8702 0.0083 0.7586 0.0213 0.8871 0.0044 0.7270 0.0905 0.7765 0.0508 0.6459 0.0691 NBA-GAT 0.8722 0.0120 0.7549 0.0171 0.8829 0.0043 0.6622 0.0514 0.7059 0.0562 0.5838 0.0558 Gated GCN 0.8477 0.0156 0.7325 0.0192 0.8671 0.0060 0.6108 0.0652 0.5824 0.0641 0.5216 0.0987 EGNN 0.8769 0.0125 0.7567 0.0221 0.8769 0.0028 0.6595 0.0527 0.6784 0.0407 0.5946 0.0573 Cens Net 0.8648 0.0138 0.7516 0.0162 0.8753 0.0076 0.6405 0.0510 0.6608 0.0463 0.6162 0.0707 datasets (Texas, Wisconsin, and Cornell) (Pei et al., 2019). We use three conventional GNN architectures - GCN (Kipf & Welling, 2017), Graph SAGE (Hamilton et al., 2017), and GAT (Veličković et al., 2018) - and one spectral GNN architecture, Cheb Net (Defferrard et al., 2016), as the backbone of NBA-GNN in Table 3. The results indicate that the non-backtracking update improves the performance of all GNN variants. Furthermore, Table 3 shows significant enhancements in heterophilic datasets. Given that related nodes in heterophilic graphs are often widely separated (Zheng et al., 2022), the ability of NBA-GNN to alleviate over-squashing plays a vital role in classifying nodes in such scenarios. 5.3 Ablation Studies In this section, we conduct ablation studies to empirically verify our framework. For simplicity, we use BA for backtracking GNNs and BG for begrudgingly backtracking GNNs. All experiments are averaged over 3 seeds, and hyper-parameters for GCN from Tönshoff et al. (2023) are used for a fair comparison. Non-backtracking vs. Backtracking. We first verify whether the performance improvements indeed stem from the non-backtracking updates. To this end, we compare our NBA-GNN with a backtracking variant, coined BA-GNN. To be clear, BA-GNN allows backtracking update prohibited in NBA-GNN, i.e., using h(ℓ) i j to update h(ℓ+1) j i in Equation (3). From Figures 4a and 4b, NBA-GCN consistently outperforms the BA-GCN and GCN regardless of the number of layers, aligning with the results of Proposition 3. Intriguingly, one can also observe that BA-GCN outperforms the naïve backbone, i.e., GCN, consistently. Published in Transactions on Machine Learning Research (09/2024) 6 8 10 12 Number of layers NBA-GCN BA-GCN GCN (a) F1 for changes in number of layers in Pascal VOC-SP 5 10 15 20 Number of layers NBA-GCN BA-GCN GCN (b) AP for changes in number of layers in Peptides-func Model BG AP MAE GCN 0.7015 0.2472 0.7207 0.2547 GIN 0.6825 0.2424 0.7071 0.2479 Gated GCN 0.6710 0.2466 0.6982 0.2489 (c) Performance of begrudgingly, nonbacktracking in sparse graphs Figure 4: Ablation studies on the components of NBA-GNN. Table 5: Training time and memory usage on the large datasets for the transductive node classification task. Datasets Model Time (s) # Param. (K) Memory (Mi B) Model Time (s) # Param. (K) Memory (Mi B) GCN 12.22 2317 24.01 GCN+NBA 38.95 3051 158.86 SAGE 10.74 3104 27.01 SAGE+NBA 47.07 3837 161.86 GAT 17.72 3130 27.21 GAT+NBA 58.20 3864 162.06 GCN 15.74 3479 60.68 GCN+NBA 52.44 5375 373.36 SAGE 14.95 4265 63.68 SAGE+NBA 59.94 6161 376.36 GAT 19.68 4292 63.87 GAT+NBA 66.92 6188 376.56 GCN 42.66 1838 47.9 GCN+NBA 242.66 2094 436.9 SAGE 56.10 2624 50.9 SAGE+NBA 320.92 2880 439.9 GAT 64.80 2650 51.0 GAT+NBA 416.85 2906 440.0 Begrudgingly Backtracking Updates in Sparse Graphs. Additionally in Figure 4c, we investigate the effect of begrudgingly backtracking in sparse graphs, i.e., Peptides-func, and Peptides-struct. One can see that begrudgingly backtracking is effective in Peptides-func, and shows similar performance in Peptides-struct (Note that the Pascal VOC-SP does not have a vertex with degree one). Non-backtracking & Edge Features Updates. From another point of view, NBA-GNN can be interpreted as an architecture only updating edge-wise features using backbone GNN layers. Therefore, we conduct additional experiments to validate that the performance improvement truly comes from nonbacktracking rather than updating edge features in Table 3. We consider three GNN architectures that also update edge features: Gated GCN (Bresson & Laurent, 2018), EGNN (Gong & Cheng, 2019), and Cens Net (Jiang et al., 2020). As seen in Table 4, NBA-GNN outperforms these models on most datasets, showing the effectiveness of non-backtracking over edge feature updates. It is noteworthy that NBA-GNN only updates edge features derived from initial node features, while EGNN and Cens Net update both node and edge features from initial node and edge features. We report detailed implementation in Appendix E.3.3. 5.4 Complexity analysis In this section, we analyze the space and time complexity of some baselines and NBA-GNN. All experiments were conducted on a single RTX 3090. Space Complexity. NBA-GNNs generate messages for each edge considering directions and pass these messages in a non-backtracking matter. This process requires 2|E| messages, and (davg 1)|E| connections among messages where davg is the average degree of the graph. Although this may seem substantial, it has not been a bottleneck in practice and can be mitigated by adjusting the batch size. Moreover, this is a relatively less computation compared to DRew, which requires computation over k-hop neighbors (with k being the number of layers). We report the memory usage for the transductive node classification task in Table 5 and the LRGB task in Table 6 (DRew and Path NN could not fit into a single GPU due to Published in Transactions on Machine Learning Research (09/2024) Table 6: Average test time (s) per epoch and memory usage for Peptides-func, Peptides-struct, Pascal VOC-SP dataset. Parenthesis refers to the batch size, and the best performance for each dataset is highlighted in bold. Metric Dataset Path NN-SP Path NN-SP+ Path NN-AP Graph GPS+Lap PE DRew-GCN+Lap PE NBA-GCN+Lap PE Memory usage Peptides-func OOM (128) OOM (128) OOM (128) 89.70% (128) 46.82% (128) 18.86% (128) Peptides-struct OOM (128) OOM (128) OOM (128) 88.08% (128) 46.82% (128) 16.98% (128) Pascal VOC-SP N/A N/A N/A 45.64% (32) OOM (32) 47.30% (32) Test time (s) Peptides-func 3.349 (64) 5.156 (64) 5.977 (64) 0.639 (128) 0.933 (128) 0.541 (200) Peptides-struct 2.120 (64) 1.417 (64) 1.372 (64) 0.926 (128) 1.053 (128) 0.532 (128) Pascal VOC-SP N/A N/A N/A 1.307 (32) 6.468 (20) 5.866 (30) Table 7: Comparison on theoretical and practical time complexity for preprocessing Peptides-func dataset. Model RFGNN Path NN-SP DRew NBA-GNN Theoretical time complexity |V|!/|V k 1|! |V| bk Pk i=1 i |E| davg |E| Practical computation time (s) N/A 666 123 68 memory overflow for Pascal VOC-SP). NBA-GNN shows less memory usage compared to graph Transformer architectures, although it consumes more memory than MPNNs. Time Complexity. We also report the average test time per epoch for LRGB in Table 6. Every experiment has been conducted with the largest batch size that fits the GPU. NBA-GNN shows competitive time compared to other baselines, though it does suffer in graphs with a high average degree. Preprocessing Time Complexity. Finally, we investigate the time complexity of preprocessing in RFGNN (Chen et al., 2022), Path NN-SP (Michel et al., 2023), DRew (Gutteridge et al., 2023), and NBAGNN on the Peptides-func dataset. To be specific, RFGNN requires k-depth non-redundant tree, Path NNSP needs the single shortest path between nodes, and DRew demands the k-hop neighbors information. NBA-GNN requires the computation of the non-backtracking edge adjacency, which can be computed in O(|E|2), even O(davg|E|) if the data is provided in the form of an adjacency list. We denote b as the branching factor, k as the number of layers, and davg as the average degree of nodes in a graph. Inevitably, previous works are dependent on the number of layers, while NBA-GNN remains irrelevant to the number of layers. In Table 7, one can see that NBA-GNN shows superiority in both theoretical time complexity and practical computation time (RFGNN codes were not reported by the authors). 6 Conclusion We have introduced a message-passing framework applicable to any GNN architectures to alleviate oversquashing. As theoretically shown, NBA-GNNs mitigate over-squashing in terms of sensitivity and enhance their expressive power for both node and graph classification tasks on SBMs. Additionally, we have demonstrated that NBA-GNNs achieve competitive performance on the LRGB benchmark, and show improvements over conventional GNNs across transductive node classification tasks, even in heterophilic datasets. Acknowledgements S. Park and S. Ahn were supported by Institute of Information & communications Technology Planning & Evaluation (IITP) grant funded by the Korea government (MSIT) (No. RS-2019-II191906, Artificial Intelligence Graduate School Program(POSTECH)), the National Research Foundation of Korea (NRF) grant funded by the Korea government (MSIT) (No. 2022R1C1C1013366), and Basic Science Research Program through the National Research Foundation of Korea (NRF) funded by the Ministry of Education(2022R1A6A1A0305295413). N. Ryu and S. Yun were supported by Institute of Information & communications Technology Planning & Evaluation (IITP) grant funded by the Korea government (MSIT) Published in Transactions on Machine Learning Research (09/2024) (No.2019-0-00075, Artificial Intelligence Graduate School Program (KAIST), 10%) and the Institute of Information & communications Technology Planning & Evaluation (IITP) grant funded by the Korea government (MSIT) (No. 2022-0-00871, Development of AI Autonomy and Knowledge Enhancement for AI Agent Collaboration, 90%). We thank Soo Yong Lee, Seongsu Kim, Hyosoon Jang, Yunhui Jang, Juwon Hwang, Hyomin Kim, for their valuable comments and suggestions in preparing the early version of the manuscript. Bibliography Emmanuel Abbe. Community detection and stochastic block models: recent developments. The Journal of Machine Learning Research, 18(1):6446 6531, 2017. Sami Abu-El-Haija, Bryan Perozzi, Amol Kapoor, Nazanin Alipourfard, Kristina Lerman, Hrayr Harutyunyan, Greg Ver Steeg, and Aram Galstyan. Mixhop: Higher-order graph convolutional architectures via sparsified neighborhood mixing. In international conference on machine learning, pp. 21 29. PMLR, 2019. Abien Fred Agarap. Deep learning using rectified linear units (relu). ar Xiv preprint ar Xiv:1803.08375, 2018. Noga Alon, Itai Benjamini, Eyal Lubetzky, and Sasha Sodin. Non-backtracking random walks mix faster. Communications in Contemporary Mathematics, 9(04):585 603, 2007. Uri Alon and Eran Yahav. On the bottleneck of graph neural networks and its practical implications. In International Conference on Learning Representations, 2021. Konstantin Avrachenkov, Laura Cottatellucci, and Arun Kadavankandy. Spectral properties of random matrices for stochastic block model. In 2015 13th International Symposium on Modeling and Optimization in Mobile, Ad Hoc, and Wireless Networks (Wi Opt), pp. 537 544. IEEE, 2015. Furqan Aziz, Richard C Wilson, and Edwin R Hancock. Backtrackless walks on a graph. IEEE transactions on neural networks and learning systems, 24(6):977 989, 2013. Muhammet Balcilar, Guillaume Renton, Pierre Héroux, Benoit Gaüzère, Sébastien Adam, and Paul Honeine. Analyzing the expressive power of graph neural networks in a spectral perspective. In International Conference on Learning Representations, 2020. Aseem Baranwal, Kimon Fountoulakis, and Aukosh Jagannath. Effects of graph convolutions in multi-layer networks. In The Eleventh International Conference on Learning Representations, 2023. Florent Benaych-Georges, Charles Bordenave, and Antti Knowles. Largest eigenvalues of sparse inhomogeneous erdős rényi graphs. The Annals of Probability, 47(3):1653 1676, 2019. Mitchell Black, Zhengchao Wan, Amir Nayyeri, and Yusu Wang. Understanding oversquashing in gnns through the lens of effective resistance. In International Conference on Machine Learning, pp. 2528 2547. PMLR, 2023. Charles Bordenave, Marc Lelarge, and Laurent Massoulié. Non-backtracking spectrum of random graphs: community detection and non-regular ramanujan graphs. In 2015 IEEE 56th Annual Symposium on Foundations of Computer Science, pp. 1347 1357. IEEE, 2015. Xavier Bresson and Thomas Laurent. Residual gated graph convnets. In International Conference on Learning Representations, 2018. Joan Bruna and X Li. Community detection with graph neural networks. stat, 1050:27, 2017. Rongqin Chen, Shenghui Zhang, Ye Li, et al. Redundancy-free message passing for graph neural networks. Advances in Neural Information Processing Systems, 35:4316 4327, 2022. Zhengdao Chen, Lisha Li, and Joan Bruna. Supervised community detection with line graph neural networks. In International Conference on Learning Representations, 2018. Published in Transactions on Machine Learning Research (09/2024) Michaël Defferrard, Xavier Bresson, and Pierre Vandergheynst. Convolutional neural networks on graphs with fast localized spectral filtering. Advances in neural information processing systems, 29, 2016. Francesco Di Giovanni, Lorenzo Giusti, Federico Barbero, Giulia Luise, Pietro Lio, and Michael M Bronstein. On over-squashing in message passing neural networks: The impact of width, depth, and topology. In International Conference on Machine Learning, pp. 7865 7885. PMLR, 2023. Vijay Prakash Dwivedi, Ladislav Rampášek, Michael Galkin, Ali Parviz, Guy Wolf, Anh Tuan Luu, and Dominique Beaini. Long range graph benchmark. Advances in Neural Information Processing Systems, 35:22326 22340, 2022. Vijay Prakash Dwivedi, Chaitanya K Joshi, Anh Tuan Luu, Thomas Laurent, Yoshua Bengio, and Xavier Bresson. Benchmarking graph neural networks. Journal of Machine Learning Research, 24(43):1 48, 2023. László Erdős, Antti Knowles, Horng-Tzer Yau, and Jun Yin. Spectral statistics of erdős rényi graphs i: Local semicircle law. The Annals of Probability, pp. 2279 2375, 2013. Dario Fasino, Arianna Tonetto, and Francesco Tudisco. Hitting times for second-order random walks. European Journal of Applied Mathematics, 34(4):642 666, 2023. Robert Fitzner and Remco van der Hofstad. Non-backtracking random walk. Journal of Statistical Physics, 150:264 284, 2013. Kimon Fountoulakis, Amit Levi, Shenghao Yang, Aseem Baranwal, and Aukosh Jagannath. Graph attention retrospective. Journal of Machine Learning Research, 24(246):1 52, 2023. Johannes Gasteiger, Stefan Weißenberger, and Stephan Günnemann. Diffusion improves graph learning. Advances in neural information processing systems, 32, 2019. Justin Gilmer, Samuel S Schoenholz, Patrick F Riley, Oriol Vinyals, and George E Dahl. Neural message passing for quantum chemistry. In International conference on machine learning, pp. 1263 1272. PMLR, 2017. Lorenzo Giusti, Teodora Reu, Francesco Ceccarelli, Cristian Bodnar, and Pietro Liò. Cin++: Enhancing topological message passing. ar Xiv preprint ar Xiv:2306.03561, 2023. Liyu Gong and Qiang Cheng. Exploiting edge features for graph neural networks. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 9211 9219, 2019. Benjamin Gutteridge, Xiaowen Dong, Michael M Bronstein, and Francesco Di Giovanni. Drew: Dynamically rewired message passing with delay. In International Conference on Machine Learning, pp. 12252 12267. PMLR, 2023. Will Hamilton, Zhitao Ying, and Jure Leskovec. Inductive representation learning on large graphs. Neur IPS, 30, 2017. Xiaoxin He, Bryan Hooi, Thomas Laurent, Adam Perold, Yann Le Cun, and Xavier Bresson. A generalization of vit/mlp-mixer to graphs. In International Conference on Machine Learning, pp. 12724 12745. PMLR, 2023. Paul W Holland, Kathryn Blackmond Laskey, and Samuel Leinhardt. Stochastic blockmodels: First steps. Social networks, 5(2):109 137, 1983. Weihua Hu, Matthias Fey, Marinka Zitnik, Yuxiao Dong, Hongyu Ren, Bowen Liu, Michele Catasta, and Jure Leskovec. Open graph benchmark: Datasets for machine learning on graphs. Advances in neural information processing systems, 33:22118 22133, 2020. Weihua Hu, Matthias Fey, Hongyu Ren, Maho Nakata, Yuxiao Dong, and Jure Leskovec. Ogb-lsc: A large-scale challenge for machine learning on graphs. In Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 2), 2021. Published in Transactions on Machine Learning Research (09/2024) Ningyuan Teresa Huang and Soledad Villar. A short tutorial on the weisfeiler-lehman test and its variants. In ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8533 8537. IEEE, 2021. Xiaodong Jiang, Ronghang Zhu, Sheng Li, and Pengsheng Ji. Co-embedding of nodes and edges with graph neural networks. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2020. Charilaos I. Kanatsoulis and Alejandro Ribeiro. Representation power of graph neural networks: Improved expressivity via algebraic analysis, 2023. Mark Condie Kempton. Non-backtracking random walks and a weighted ihara s theorem. Open Journal of Discrete Mathematics, 2016. Jin Hyung Kim and Judea Pearl. A computational model for causal and diagnostic reasoning in inference systems. In International Joint Conference on Artificial Intelligence, pp. 0 0, 1983. Thomas N Kipf and Max Welling. Semi-supervised classification with graph convolutional networks. In International Conference on Learning Representations, 2017. Devin Kreuzer, Dominique Beaini, Will Hamilton, Vincent Létourneau, and Prudencio Tossou. Rethinking graph transformers with spectral attention. Advances in Neural Information Processing Systems, 34: 21618 21629, 2021. Florent Krzakala, Cristopher Moore, Elchanan Mossel, Joe Neeman, Allan Sly, Lenka Zdeborová, and Pan Zhang. Spectral redemption in clustering sparse networks. Proceedings of the National Academy of Sciences, 110(52):20935 20940, 2013. Jing Lei and Alessandro Rinaldo. Consistency of spectral clustering in stochastic block models. The Annals of Statistics, pp. 215 237, 2015. Yuan Lin and Zhongzhi Zhang. Non-backtracking centrality based random walk on networks. The Computer Journal, 62(1):63 80, 2019. Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. In International Conference on Learning Representations, 2018. Pierre Mahé, Nobuhisa Ueda, Tatsuya Akutsu, Jean-Luc Perret, and Jean-Philippe Vert. Extensions of marginalized graph kernels. In Proceedings of the twenty-first international conference on Machine learning, pp. 70, 2004. Gaspard Michel, Giannis Nikolentzos, Johannes F Lutzeyer, and Michalis Vazirgiannis. Path neural networks: Expressive and accurate graph neural networks. In International Conference on Machine Learning, pp. 24737 24755. PMLR, 2023. Kevin P Murphy, Yair Weiss, and Michael I Jordan. Loopy belief propagation for approximate inference: an empirical study. In Proceedings of the Fifteenth conference on Uncertainty in artificial intelligence, pp. 467 475, 1999. MEJ Newman. Spectral community detection in sparse networks. ar Xiv preprint ar Xiv:1308.6494, 2013. Hoang NT and Takanori Maehara. Revisiting graph neural networks: All we have is low-pass filters. ar Xiv preprint ar Xiv:1905.09550, 2019. Kenta Oono and Taiji Suzuki. Graph neural networks exponentially lose expressive power for node classification. In International Conference on Learning Representations, 2020. Judea Pearl. Reverend bayes on inference engines: A distributed hierarchical approach. In Proceedings of the AAAI conference on artificial intelligence, pp. 133 136, 1982. Hongbin Pei, Bingzhe Wei, Kevin Chen-Chuan Chang, Yu Lei, and Bo Yang. Geom-gcn: Geometric graph convolutional networks. In International Conference on Learning Representations, 2019. Published in Transactions on Machine Learning Research (09/2024) Ladislav Rampášek, Michael Galkin, Vijay Prakash Dwivedi, Anh Tuan Luu, Guy Wolf, and Dominique Beaini. Recipe for a general, powerful, scalable graph transformer. Advances in Neural Information Processing Systems, 35:14501 14515, 2022. Brian Rappaport, Anuththari Gamage, and Shuchin Aeron. Faster clustering via non-backtracking random walks. stat, 1050:26, 2017. Prithviraj Sen, Galileo Namata, Mustafa Bilgic, Lise Getoor, Brian Galligher, and Tina Eliassi-Rad. Collective classification in network data. AI magazine, 29(3):93 93, 2008. Hamed Shirzad, Ameya Velingker, Balaji Venkatachalam, Danica J Sutherland, and Ali Kemal Sinop. Exphormer: Sparse transformers for graphs. In International Conference on Machine Learning, 2023. Ludovic Stephan and Laurent Massoulié. Non-backtracking spectra of weighted inhomogeneous random graphs. Mathematical Statistics and Learning, 5(3):201 271, 2022. Jan Tönshoff, Martin Ritzert, Eran Rosenbluth, and Martin Grohe. Where did the gap go? reassessing the long-range graph benchmark. In The Second Learning on Graphs Conference, 2023. Jake Topping, Francesco Di Giovanni, Benjamin Paul Chamberlain, Xiaowen Dong, and Michael M. Bronstein. Understanding over-squashing and bottlenecks on graphs via curvature. In International Conference on Learning Representations, 2022. Petar Veličković, Guillem Cucurull, Arantxa Casanova, Adriana Romero, Pietro Liò, and Yoshua Bengio. Graph attention networks. In International Conference on Learning Representations, 2018. Chloe Wang, Oleksii Tsepa, Jun Ma, and Bo Wang. Graph-mamba: Towards long-range graph sequence modeling with selective state spaces. ar Xiv preprint ar Xiv:2402.00789, 2024. Qing Wang, Dillon Ze Chen, Asiri Wijesinghe, Shouheng Li, and Muhammad Farhan. N-wl: A new hierarchy of expressivity for graph neural networks. In The Eleventh International Conference on Learning Representations, 2023. Keyulu Xu, Weihua Hu, Jure Leskovec, and Stefanie Jegelka. How powerful are graph neural networks? In International Conference on Learning Representations, 2019. Zhilin Yang, William Cohen, and Ruslan Salakhudinov. Revisiting semi-supervised learning with graph embeddings. In International conference on machine learning, pp. 40 48. PMLR, 2016. Chengxuan Ying, Tianle Cai, Shengjie Luo, Shuxin Zheng, Guolin Ke, Di He, Yanming Shen, and Tie-Yan Liu. Do transformers really perform badly for graph representation? Advances in Neural Information Processing Systems, 34:28877 28888, 2021. Se-Young Yun and Alexandre Proutière. Optimal cluster recovery in the labeled stochastic block model. Advances in Neural Information Processing Systems, 29, 2016. Se-Young Yun and Alexandre Proutière. Optimal sampling and clustering in the stochastic block model. Advances in Neural Information Processing Systems, 32, 2019. Xin Zheng, Yixin Liu, Shirui Pan, Miao Zhang, Di Jin, and Philip S Yu. Graph neural networks for graphs with heterophily: A survey. ar Xiv preprint ar Xiv:2202.07082, 2022. Published in Transactions on Machine Learning Research (09/2024) Appendix 17 A Comparison with Related Works 18 A.1 Line Graph Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 A.2 Redundancy-free Graph Neural Network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 A.3 Stochastic Block Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 B Proofs for Section 3.1 20 B.1 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 B.2 Access Time of Simple Random Walks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 B.3 Access Time of Begrudgingly Backtracking Random Walks . . . . . . . . . . . . . . . . . . . 23 B.4 Proof of Proposition 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 C Proofs for Section 4.1 29 C.1 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 C.2 Proof of Lemma 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 C.3 Proof of Proposition 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 D Proofs for Section 4.2 32 D.1 Proof of Proposition 4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 D.2 Proof of Proposition 5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 D.3 Proof of Lemma 8 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 D.4 Proof of Lemma 9 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 E Experiment Details 37 E.1 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 E.2 Long-Range Graph Benchmark . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 E.3 Transductive Node Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 Published in Transactions on Machine Learning Research (09/2024) A Comparison with Related Works This section delves into the detailed exploration of key related works, highlighting essential distinctions from our NBA-GNN. In addition, we present simple experiments to prove the superiority of our model. A.1 Line Graph Neural Networks Line Graph Neural Networks (LGNN) (Chen et al., 2018) were the pioneers in applying the non-backtracking operator to GNNs. However, it does not resolve redundancy issues, as it employs both an adjacency matrix and non-backtracking matrix in a layer. In other words, our main contribution over LGNN is in consideration of the non-backtracking operator specifically for the message redundancy problem, in a end-to-end manner. Notably, the computational complexity of LGNN is acknowledged to be close to |V | log(|V |) by its authors. Table 8: Comparison of LGNN and NBA-GCN on Peptides-func. Model Training AP Test AP Test time per epoch (s) GPU usage (%) LGNN 0.4202 0.3778 87.761 97.10 NBA-GCN+Lap PE 0.9724 0.7207 0.541 51.18 In our experiments on Peptides-func, we compare the performance and complexity of LGNN and NBAGCN+Lap PE, as detailed in Table 8. Given that LGNN was initially proposed using only node degrees for node features, we incorporated an additional node feature encoder, similar to our NBA-GNN setup. Specifically, we used a batch size of 64, hidden dimension of 24, and 4 layers for LGNN. Refer to Appendix E.2 for the hyperparameters of NBA-GCN. The drawbacks of LGNN in terms of time and space complexity become evident when compared to NBA-GNN, while requiring space and time for computing both simple random walks and non-backtracking walks, but result in poor performance. A.2 Redundancy-free Graph Neural Network Redundancy-Free Graph Neural Network (RFGNN) (Chen et al., 2022) shares the common motivation with NBA-GNN, aiming to reduce redundant messages in the computation graph. RFGNN achieves this by constructing a tree for every node, coined Truncated e Path Tree (TPT). TPT ensures that there are no repeated nodes along the simple path from the root to the leaf, except for the root node which can appear twice in the path. This approach differs significantly from NBA-GNN, which eliminates redundancy by identifying non-backtracking edge adjacency. In terms of complexity, RFGNN is known to have a space and time complexity of O |V |! |V t 1|! , where |V | is the number of nodes and t is the number of GNN layers. In contrast, NBA-GNN achieves superior complexity with space complexity O(2|E|) and time complexity O(davg|E|), which remains irrelevant to the number of layers. The preprocessing process time complexity, involved in finding non-backtracking edge adjacency, is O(|E|2). This scalability allows NBA-GNN to handle larger graph dataset compared to RFGNN. Furthermore, in theoretical aspects, our work distinguishes itself by providing the sensitivity upper bound of non-backtracking updates, rather than comparing the relative inference of paths. Table 9: Average test accuracy of LGNN, NBA-GNN and BP on two sparse SBMs, where n represents the number of vertices, C is the number of classes, and p and q denote edge probabilities within and across communities, respectively. Model Parameters LGNN NBA-GNN BP (n, C, p, q) (a) Binary assortative SBM (400, 2, 20/n, 10/n) 0.4885 0.4900 0.5303 (b) 5-community dissociative SBM (400, 5, 0, 18/n) 0.1821 0.1888 0.2869 Published in Transactions on Machine Learning Research (09/2024) A.3 Stochastic Block Models We also conduct a comparative analysis for our method on two sparse Stochastic Block Models (SBMs) with distinct parameters for Belief propagation (BP), and LGNN: (a) a binary assortative SBM (n = 400, C = 2, p = 20/n, q = 10/n), and (b) a 5-community dissociative SBM (n = 400, C = 5, p = 0, q = 18/n). To be specific, n denotes the number of vertices, C denotes the number of classes, and p and q denote edge probabilities within and across communities, respectively. Table 9 illustrates the average test accuracy across 100 graphs. Notably, BP, known for achieving the information-theoretic threshold, exhibited the best performance, consistent with expectations. Additionally, NBA-GNN outperforms LGNN in both scenarios. Published in Transactions on Machine Learning Research (09/2024) B Proofs for Section 3.1 In this section, we present the findings discussed in Section 3.1. Similar to Di Giovanni et al. (2023), we concentrate on the relationship between over-squashing and access time of non-backtracking random walks. Our study establishes that the access time using a begrudgingly backtracking random walk (BBRW) is smaller than that of a simple random walk (SRW) between two nodes, in tree graphs. Also, it is noteworthy that the gap between these two access times increases as the length of the walk grows. In the following, we will compare the access times between BBRW and SRW. First, we show that on the tree graph, the access time equals the sum of access times between neighboring nodes. Note that the access time between neighboring nodes, which is the cut-point, can be represented in terms of return time. The formulas for return time in Fasino et al. (2023) and Lemma 6 allow us to derive the formulas for access time between neighboring nodes. Finally, we derive and compare the access time of BBRW and SRW, i.e., Proposition 1. Proposition 1. Given a tree G = (V, E) and a pair of nodes i, j V, the access time of begrudgingly backtracking random walk is equal to or smaller than that of a simple random walk. The equality holds if and only if the walk length is 1. B.1 Preliminaries B.1.1 Simple and Begrudgingly Random Walks A random walk on a graph G = (V, E) is a sequence of V-valued random variable x0, x1, x2 . . . where xn+1 is chosen randomly from neighborhood of xn. Different types of random walks have different probabilities for selecting neighboring nodes. Simple random walk (SRW) choose a next node j uniformly from the neighbors of current node i: P(xn+1 = j|xn = i) = ( 1 di if (i, j) E 0 otherwise . On the other hand, begrudgingly backtracking random walk (BBRW) tries to avoid the previous node when there is another option: P(xn+2 = k|xn+1 = j, xn = i) = 1 dj 1 when (j, k) E, k = i, and |N(j) \ {i}| 1 1 when (j, k) E and |N(j) \ {i}| = 0 0 otherwise B.1.2 Access Time Consider a SRW starting at a node i V. Let Ti denote the time when a SRW first arrived at node i, Ti := min{n 0|xn = i}, and T i be the time when a random walk first arrived at node i after the first step, T i := min{n > 0|xn = i}. With slight abuse of notation, we define access time t(i, j) from i to j, access time t(i j, k) from i to k where x1 = j, and return time t(i; G) in a graph G as follows: t(i, j) := E[Tj|x0 = i] t(i j, k) := E[Tk|x1 = j, x0 = i] t(i; G) := E[T i|x0 = i] Similarly, we denote the access time from i to j, access time from i to k with where x1 = j and return time of BBRW as t(i, j), t(i j, k) and t(i; G), respectively. Note that the first step of BBRW shows the same behavior as the SRW since there is no previous node on the first step. Finally, for a tree graph G and pair of nodes i, j in Proposition 1, we denote the unique paths between i and j as (v0, . . . , v N) with i = v0, j = v N. We also let N denote the distance between the nodes i and j. Published in Transactions on Machine Learning Research (09/2024) B.2 Access Time of Simple Random Walks In this section, we derive the access time of simple random walks (SRW) on tree graphs between two nodes i and j, i.e., t(i, j). To be specific, we show this in the following process. 1. We decompose the access time for two nodes i and j into a summation of the access time of neighboring nodes in the path, i.e., t(vn, vn+1) for n 0, , N 1 (Appendix B.2.1 and Lemma 2). 2. We evaluate the access time of neighboring nodes in the path, e.g., t(vn, vn+1), using the number of edges in a graph (Appendix B.2.2 and Lemma 3). 3. We formulate the access time of simple random walks (SRW) between two nodes i and j, i.e., t(i, j) (Appendix B.2.3 and Proposition 6). B.2.1 Decomposition of Access Time First, we show that the access time is equal to the sum of access times between neighboring nodes. When a random walker travels from v0 to v N, it must pass through all nodes vn on the paths. Intuitively, We can consider the entire time taken as the summation of the time intervals between when the walker arrived at vn and when it arrived at vn+1. Lemma 2. Given a tree G and path (v0, . . . , v N), t(v0, v N) = n=0 t(vn, vn+1) . Proof. Given a unique path (v0, . . . , v N) between v0 and v N, any walk between (v0, v N) can be decomposed into a series of N 1 walks between (v0, v1), (v1, v2), . . . (v N 1, v N) such that the walk between (vn, vn+1) does not contain a node vn+1 except at the end point. The expected length of each walk is t(vn, vn+1). To be specific, consider the following decomposition: t(v0, v N) = E[Tv N |x0 = v0] = n=0 E[Tvn+1 Tvn|x0 = v0] . (7) From the Markov property of SRW: E[Tvn+1 Tvn|x0 = v0] = t(vn, vn+1) . Finally, we can formulate the access time as the following: t(v0, v N) = n=0 E[Tvn+1 Tvn|x0 = v0] = n=0 t(vn, vn+1) . B.2.2 Access Time between Neighbors Now, we derive the formula for the access time between neighboring nodes. Lemma 3. Given a tree graph G and adjacent nodes i, j, the associated access time t(i, j) is defined as follows: t(i, j) = 1 + 2|E(Gi)| , where E(G) is edge set of graph G and Gi is the subtree produced by deleting edge (i, j) and choosing connected component of i. Published in Transactions on Machine Learning Research (09/2024) Proof. Given a random walk from i to j that only contains j once, every time the walk lands at the node i, the next step can be categorized into two scenarios: 1. The walk transitions from node i to j based on transitioning with respect to the edge (i, j) E with probability 1 di . The walk terminates. 2. The walk fails to reach j and continues the walk in the subtree Gi until arriving at the node i again. Note that the walk cannot arrive at node j without arriving at node i in prior. In other words, the walk continues for the return time of i concerning the graph Gi. The two scenarios imply that every time the walk arrives at node i, the walk terminates with probability 1 di . Then the number of trials follows the geometric distribution and consequently, the average number of trials is di. In other words, the walk falls into the scenario of type 2, for di 1 times on average. We have to traverse at least one edge (i, j). The expected total penalty is the product of the average failure penalty and the average number of failures. Thus, t(i, j) = 1 + (di 1)t(i; Gi) , (8) where t(i; Gi) is the return time of i with respect to the subgraph Gi. Since the return time is the ratio of the sum of the degree to the degree of the node, as stated by Fasino et al. (2023), the following equation holds: t(i; Gi) = 2|E(Gi)| which directly implies our conclusion of t(i, j) = 1 + 2|E(Gi)|. B.2.3 Main Result Using Lemma 2 and Lemma 3, we arrive at our main result for the access time of SRW on trees. Proposition 6. Given a tree G and pair of nodes i, j V, the following equations hold for the access time of SRW between u and v. n=0 (1 + 2|E(Gn)|) , where E(G) is edge set of graph G and Gn is the subtree produced by deleting edge (vn, vn+1) and choosing connected component of vn. Proof. The proof is a straightforward combination of Lemma 2 and Lemma 3. n=0 t(vn, vn+1) = n=0 (1 + 2|E(Gn)|) . Published in Transactions on Machine Learning Research (09/2024) B.3 Access Time of Begrudgingly Backtracking Random Walks In this section, we derive the access time of begrudgingly backtracking random walks BBRW) between two nodes i and j, i.e., t(i, j), on tree graphs. At a high level, this section consists of the following order. 1. We decompose the access time for two nodes v0 and v N into summations of the access time of neighboring nodes in the path (Appendix B.3.1 and Lemma 4). 2. The decomposed term in 1. can be formulated using (i) the return time and (ii) the access time between neighboring nodes (Appendix B.3.2 and Lemma 5). 3. The return time of a node can be formulated in terms of the number of edges and the degree of a node (Appendix B.3.3 and Lemma 6). 4. The access time between neighboring nodes can be formulated using the number of edges and the degree of the node (Appendix B.3.4 and Lemma 7). 5. Finally we formulate the access time of begrudgingly backtracking random walks (BBRW), i.e., t(i, j) (Appendix B.3.5 and Proposition 7). B.3.1 Decomposition of Access Time We start by decomposing the access time t(i, j) similar to the one in Lemma 2. However, a key difference exists in the BBRW random walk. When the walk first arrives at node vn, it previously passed the node vn 1. Therefore, we cannot return to vn 1 upon the first failure to reach vn+1. Lemma 4. Consider a tree graph G and path (v0, . . . , v N). Then the access time of BBRW between node v0 and v N can be derived as follows: t(v0, v N) = t(v0 v1, v0) + t(vn 1 vn, vn+1) 1 Proof. The proof is similar to Lemma 2 such that we decompose the walk into a series of N 1 walks between (v0, v1), (v1, v2), . . . , (v N 1, v N) such that the walk between vn, vn+1 does not contain a node vn+1 except at the endpoint. The expected length of each walk is t(vn 1 vn, vn+1) 1. Note that t(vn 1 vn, vn+1) counts the length of walk from vn 1 to vn+1, hence should be substracted to represent the length of walk from vn to vn+1. To be specific, we have noted that Equation (7) holds for general random walks: t(v0, v N) = n=0 E[Tvn+1 Tvn|x0 = v0] . Next, when the BBRW random walk reaches vn at Tvn, it has passed node vn 1, i.e., x Tvn 1 = vn 1. Therefore, the random walk after the time t = Tvn 1 is the random walk starting at vn 1 with the second node being x1 = vn. Therefore, E[Tvn+1 Tvn|x0 = v0] = t(vn 1 vn, vn+1) 1 for n 1 . B.3.2 Expressing Transition-conditioned Access Time Using Subgraph-return Time Next, we formulate the transition-conditioned access time t(vn 1 vn, vn+1). Published in Transactions on Machine Learning Research (09/2024) n Hfn Y9Fac PKZY/g D5/MHaye Pkg= i COtx BA1r AOEZXu HNe XRen Hfn Y9Fac PKZY/g D5/MH1le M9w=l (a) Figure of nodes i, j, k, l in B.3.2 A+6g CS0gw OEZXu HNe XRen Hfn Y9Fac PKZY/g D5/MHaye Pkg= th I2osz Yb Eo2BG/5VXSrl W9y+p Fs1ap3+Rx FOETu Ec PLi COtx BA1r AOEZXu HNe XRen Hfn Y9Fac PKZY/g D5/MH0cu M9A=i MXFil T8JY2ZKGLNTf E1Ma T2JAts ZUTPUq95c/M/rp Ca89adc Jql By Za Lwl QE5P516TPFTIj Jp ZQpri9lb Ah VZQZm03Bhu Ctvrx Omp Wyd12+qld K1bsjycw Tlcgc3UIUHq EDGCA8wyu8OSPnx Xl3Ppat OSeb OYU/c D5/ANPj PU=j n ACp3AOHlx BA+6g CS0gw OEZXu HNe XRen Hfn Y9Fac PKZY/g D5/MHaye Pkg= (b) Subtree Gi produced by deleting edge (i, j) and choosing connected components to node i Figure 5: References for section B.3.2 Lemma 5. Given a tree G = (V, E) and three nodes i, j, and k such that (i, j), (j, k) E, t(i j, k) 1 = t(j, k) di 1 dj t(i; Gi) 2 where t(i; H) is return time of node i in a subgraph H, and Gi is the subtree produced by deleting edge (i, j) and choosing connected components of i. Proof. We start with establishing basic equalities for access time of BBRW. The first equality describes how the access time of a random walk from i j to k can be expressed by its subset j l, k where l = i due to the non-backtracking property, t(i j, k) 1 = 1 dj 1 + X l N(j)\{i,k} 1 dj 1 t(j l, k) . (10) The next equality describes how the access time from j to k can be decomposed by considering subsequent transitions to some l in the neighborhood of j. t(j, k) = 1 l N(j) t(j l, k) . (11) When plugging in Equation (11) into Equation (10), one can see that we need to consider l = i, k for t(j l, k). In the case of k, it is trivially 1. For the case of i, i.e., t(j i, k), it can be defined as follow: t(j i, k) = 1 + (di 1) t(i; Gi) + t(i j, k) . (12) This is similar to Equation (8), where Gi describes how the walk from j i to k can be divided into two scenarios: (i) continues the walk in Gi or (ii) i transitions into j with probability 1/(di 1). Starting from Equation (10), one can derive the following relationship: t(i j, k) 1 = 1 dj 1 1 + X l N(j)\{i,k} 1 dj 1 t(j l, k) (a) = 1 dj 1 + 1 dj 1 l N(j) t(j l, k) t(j i, k) t(j k, k) (b) = 1 dj 1 + 1 dj 1 dj t(j, k) t(j i, k) 1 (c) = 1 dj 1 dj t(j, k) (di 1) t(i; Gi) t(i j, k) 1 Published in Transactions on Machine Learning Research (09/2024) where (a) is from Equation (10), (b) is from Equation (11) and t(j k, k) = 1, (c) is from Equation (12). Now, by multiplying dj 1 on both sides, we get t(i j, k) 1 = dj t(j, k) (di 1) t(i; Gi) t(i j, k) 1 = dj t(j, k) (di 1) t(i; Gi) t(i j, k) 1 Thus, we can conclude for t(i j, k) 1: t(i j, k) 1 = t(j, k) di 1 dj t(i; Gi) 2 B.3.3 Return Time with respect to a Subgraph Now, we formulate the return time t(i; G) for a tree-graph G. Considering the return time, we prove that the return time of BBRW is less than or equal to that of SRW. Lemma 6. Given a tree G = (V, E) and a node i, the return time of i for a BBRW is the following: t(i; G) = 2|E| Proof. Consider the tree as a rooted tree where the root is i. In the following, we will use mathematical induction based on the tree height of i. First, consider the base case where the height of the tree is 1. Then, whatever we choose as the next node x1 from x0 = i, we return to i at the second transition (i.e., x2 = i) since all the neighbors of i are leaf node. Since di = |E| for tree with height 1, t(i; G) = 2 = 2|E| Now, assume that the lemma holds for the tree with a height less than k 1. It suffices to show that the lemma also holds for a tree with height k + 1 and its root i. From the same perspective in the proofs of the Lemma 3, we can view the random walk returning to i as follows: 1. We choose a node x1 = l from N(i) uniformly. Then, from node l, we return to l to reach i. (i.e., we have to pay penalty amounts to return time of l) 2. In node l, we try to reach i by selecting edge (l, i) with probability 1 dl 1. Thus, the average number of failures is dl 2. 3. If we fail to reach i from l, we should return to l for a next chance to reach i. i.e., an average failure penalty amounts to a return time of l in the subtree with root l. t(i; G) = 1 + X 1 + t(l; Gl) 1 di t(l; Gl) (dl 1) , where t(l; Gl) is the return time from node l for graph G and Gl denotes the subtree with root l, respectively. Published in Transactions on Machine Learning Research (09/2024) Since the subtree with root l has height less than k, t(l; Gl) = 2|E(Gl)| dl 1 . Therefore, t(i; G) = 2 + 1 l N(i) t(l; Gl) (dl 1) dl 1 (dl 1) l N(i) 2|E(Gl)| l N(i) |E(Gl)| B.3.4 Access Time between Neighbors For access time between neighbors, we achieve a similar result to Lemma 3. Lemma 7. Given a tree G and adjacent nodes i, j, t(i, j) = 1 + 2|E(Gi)| di 1 where E(G) is edge set of graph G, and Gi is the subtree produced by deleting edge (i, j) and choosing connected component of i. Proof. We follow the same perspective of the proofs in Lemma 3. We first analyze the success probability for each trial, which is 1 di since there is no previous node. After the first trial, the success probability is fixed to 1 di 1. On each trial, the average failure penalty amounts to t(i; Gi), which is the return time of i on subtree Gi. Thus, the average number of trials until first success is, (Average number of trial) = 1 di 1 + di 1 The average number of failures is as follows: (Average number of failure) = (Average number of trial) 1 di 1 + di 1 di (1 + di 1) 1 Thus, the expected total penalty is as follows: (Expected total penalty) = t(i; Gi) (di 1)2 Since t(i; Gi) = 2|E(Gi)| di 1 by Lemma 6, t(u, v) = 1 + 2|E(Gi)| di 1 Published in Transactions on Machine Learning Research (09/2024) B.3.5 Main Result Finally, we show that the access time of BBRW between two nodes i and j, i.e., t(i, j), in Proposition 7. Proposition 7. Given a tree G and a pair of nodes i, j, the following equations hold for the access time of BBRW between i and j. 1 + 2|E(Gn)| dvn 1 where E(G) is the edge set of G and Gn is the subtree produced by deleting edge (vn, vn+1) and choosing connected component of vn. Proof. By Lemma 4 and Equation (9) of Lemma 5, t(v0, v N) = t(v0, v1) + t(vn 1 vn, vn+1) 1 = t(v0, v1) + t(vn, vn+1) dvn 1 1 dvn t(vn 1) 2 dvn n=0 t(vn, vn+1) dvn t(vn 1) 1 + 2|E(Gl)| dvn 1 Published in Transactions on Machine Learning Research (09/2024) B.4 Proof of Proposition 1 Finally, we compare the access time of two random walks in a tree. Recall Proposition 1. Proposition 1. Given a tree G = (V, E) and a pair of nodes i, j V, the access time of begrudgingly backtracking random walk is equal to or smaller than that of a simple random walk. The equality holds if and only if the walk length is 1. Proof. Let E(G) be the edge set of G and Gn be the subtree produced by deleting edge (vn, vn+1) and choosing connected component of vn. Then, t(v0, v N) t(v0, v N) = 1 + 2|E(Gn)| dvn 1 n=0 (1 + 2|E(Gn)|) Therefore the access time of two nodes is always less or equal in begrudgingly backtracking random walks than simple random walks, where equality holds for random walks with length 1. Published in Transactions on Machine Learning Research (09/2024) C Proofs for Section 4.1 C.1 Preliminaries Let G be a graph with a set of n vertices V, a set of m edges E V2. We use xi to denote the node-wise feature for i V, and di for the degree of node i V. The adjacency matrix A Rn n encodes the connectivity of graph G. For node i V, we define the set of incident outgoing edges of i as N + e (i), the set of incident incoming edges of i as N e (i), and let Ne(i) = N + e (i) N e (i) be the set of all incident edges of node i. Also, recall the non-backtracking matrix B {0, 1}2|E| 2|E| and the incidence matrix C R2|E| |V|: B(ℓ k),(j i) = ( 1 if k = j, ℓ = i 0 otherwise , C(k j),i = ( 1 if j = i or k = i 0 otherwise . We also define Dout and Din as the out-degree and in-degree matrices of NBA-GNN, respectively, counting the number of outgoing and incoming edges for each edge. These are diagonal matrices with (Dout)(j i),(j i) = P ℓ k B(j i),(ℓ k) and (Din)(j i),(j i) = P ℓ k B(ℓ k),(j i), for each index j i. Next, we introduce b B as the normalized non-backtracking matrix augmented with self-loops: b B = (Dout + I) 1 2 (B + I)(Din + I) 1 2 . Then one obtains the following sensitivity bound of NBA-GNN. We also let C denote a matrix where C(k j),i = C(k j),i + C(j k),i. Consequently, one can express our NBA-GNN updates: h(t+1) j i = ϕ(t) h(t) j i, X (k,ℓ) E b B(ℓ k),(j i)ψ(t) h(t) ℓ k, h(t) j i where ϕ(t) and ψ(t) corresponds to the update and the aggregation as described in Equation (3). Next, the node-wise aggregation step can be described as follows: (j,k) E C(k j),iρ h(T ) k j , X (j,k) E C(j k),iρ h(T ) j k C.2 Proof of Lemma 1 The original sensitivity bound from Topping et al. (2022) for a hidden representation of node j and an initial feature of node i is defined as the following: Proposition 2 (Sensitivity bounds). (Topping et al., 2022) Assume an MPNN defined in Equation (1). Let two nodes i, j V with distance T. If ϕ(t) 1 α and ψ(t) 1 β for 0 t < T, then the sensitivity bound can be defined as the following: h(T ) j xi 1 (αβ)T ( b AT )j,i , (6) where b A denotes the degree-normalized adjacency matrix. Now, we show that the sensitivity bound for non-backtracking GNNs can be defined as following. Following Topping et al. (2022), we assume the node features and hidden representations are scalar for better understanding. Lemma 1 (Sensitivity bounds of NBA-GNNs). Consider two nodes i, j V with a random walk distance T given a (T 1)-layer NBA-GNN as described in Equation (3) and Equation (4). Suppose ϕ(t) 1 , σ 1 α, ψ(t) 1 , ρ 1 β, and ϕ 1 γ for 0 t < T. Then the following holds: hj xi 1 (αβ)T γ( C b B(T 1) C)j,i . Published in Transactions on Machine Learning Research (09/2024) Proof. Just a straight calculation using the chain rule is enough to derive the above upper bound. k ℓ N e (j) h(T 1) k ℓ xi k ℓ N + e (j) h(T 1) k ℓ xi k ℓ N e (j) h(T 1) k ℓ xi k ℓ N e (j) h(T 1) k ℓ xi where the bound for the derivatives were σ 1 α, ρ 1 β. Thus considering the message update from Equation (13), h(T 1) k ℓ xi = 1ϕ(T 2) h(T 2) k ℓ xi + 2ϕ(T 2) k ℓ N e (k) b Bk ℓ ,k ℓ h(T 2) k ℓ xi k ℓ N e (k) b Bk ℓ ,k ℓ h(T 2) k ℓ xi since the distance between node i and node j is at least T, therefore h(T 2) k ℓ xi = 0. Now, let the path from node i to node j as s(i, j), where st denotes the t-th node in the walk s, i.e., s0 = i, s T = j. Using the bound of the derivative of functions, we can iterate like the following. h(T 1) k ℓ xi k ℓ N e (k) b Bk ℓ ,k l h(T 2) k l xi (s0,...,s T ) s(i,j) b Bs0 s1,s1 s2 b Bs T 2 s T 1,s T 1 s T h(0) s0 s1 xi = (αβ)T 1 X (s0,...,s T ) s(i,j) t=1 b Bst 1 st,st st+1 h(0) s0 s1 xi (s0,...,s T ) s(i,j) t=1 b Bst 1 st,st st+1 Substitute the inequality Equation (15) into Equation (14) to get the final result. Then, we can get (s0,...,s T ) s(i,j) t=1 b Bst 1 st,st st+1 = (αβ)T γ( C b BT 1 C)j,i , since paths can be expressed using the power of adjacency-type matrix. Published in Transactions on Machine Learning Research (09/2024) C.3 Proof of Proposition 3 We have defined the sensitivity bound for NBA-GNNs. Now, we show that the sensitivity bound of NBAGNNs are bigger than the sensitivity bound of GNNs. Proposition 3. Consider an MPNN defined as in Equation (1) and a (T 1)-layer NBA-GNN described by Equation (3) and Equation (4). For any pair of nodes i, j V with distance T, the following inequality holds between sensitivity bounds: ( b AT )j,i ( C b BT 1 C)j,i . For d-regular graphs, ( C b BT 1 C)j,i decays slower by O(d T ), while ( b AT )j,i decays with O((d + 1) T ). Proof. Identical to the notations above, we denote the list of nodes from node i to node j as path s(i, j) where s(i, j) = (s0 = i, s1, , s T 1, s T = j). ( b AT )j,i = X (s0,...,s T ) s(i,j) b As0,s1 b As T 1,s T (s0,...,s T ) s(i,j) (di + 1) 1 2 (dj + 1) 1 t=1 (dst + 1) 1 . ( C b BT 1 C)j,i = X (s0,...,s T ) s(i,j) d 1 2 s0 b Bs0 s1,s1 s2 b Bs T 2 s T 1,s T 1 s T d 1 (s0,...,s T ) s(i,j) d 1 2 s0 d 1 s1 d 1 s T 1 d 1 (s0,...,s T ) s(i,j) d 1 t=1 d 1 st . Now, for a path (s0 = i, . . . , s T = j), 2 (dj + 1) 1 t=1 (dst + 1) 1 d 1 Each term in ( C b BT 1 C)j,i is larger than ( b AT )j,i, thus ( C b BT 1 C)j,i ( b AT )j,i is trivial. For d-regular graphs, di = d, i V. Therefore the sensitivity bound can can be written as following: ( b AT )j,i = (d + 1) T , ( C b BT 1 C)j,i = d T . So ( C b BT 1 C)j,i decays slower by O(d T ), while ( b AT )j,i decays with O((d + 1) T ). Published in Transactions on Machine Learning Research (09/2024) D Proofs for Section 4.2 To assess the expressive capabilities, we initially make an assumption about the graphs, considering that it is generated from the Stochastic Block Model (SBM), which is defined as follows: Definition 1. Stochastic Block Model (SBM) is generated using parameters (n, K, α, P), where n denotes the number of vertices, K is the number of communities, α = (α1, ..., αK) represents the probability of each vertex being assigned to communities V1, ..., VK, and Pij denotes the probability of an edge (v, w) E between v Vi and w Vj. Numerous studies have focused on the problem of achieving exact recovery of communities within the SBM. However, these investigations typically address scenarios in which the average degree is at least on the order of Ω(log n) (Abbe, 2017). It is well-established that the information-theoretic limit in such cases can be reached through the utilization of the spectral method, as demonstrated by Yun & Proutière (2016). In contrast, when dealing with a graph characterized by the average degree much smaller, specifically o (log n), recovery using the graph spectrum becomes a more challenging endeavor. This difficulty arises due to the fact that the n1 o(1) largest eigenvalues and their corresponding eigenvectors are influenced by the high-degree vertices, as discussed in Benaych-Georges et al. (2019). However, real-world benchmark datasets often fall within the category of very sparse graphs. For example, the citation network dataset discussed in Sen et al. (2008) has an average degree of less than three. In such scenarios, relying solely on an adjacency matrix may not be an efficient approach for uncovering the hidden graph structure. Fortunately, an alternative strategy is available by utilizing a non-backtracking matrix. D.1 Proof of Proposition 4 Let s rephrase the formal version of Proposition 4 as follows: Proposition 4. Consider a Stochastic Block Model (SBM) with parameters n, 2, 1 n , where (a + b) satisfies the conditions of being at least ω(1) and no(1). In such a scenario, the non-backtracking graph neural network can accurately map from graph G to node labels, with the probability of error decreasing to 0 as n . In Stephan & Massoulié (2022), the authors demonstrate that the non-backtracking matrix exhibits valuable properties when the average degree of vertices in the graph satisfies ω(1) and no(1). When K = 2, we define a function σ(v) for v V, such that σ(v) = 1 if v is in the first class, and σ(v) = 1 in the second class. Then, they establish the following proposition: Proposition 8. (Stephan & Massoulié, 2022) Suppose we have a SBM with parameters n, 2, 1 n . In this case, we have two eigenvalues µ1 > µ2 of ndiag(α)P, and the eigenvector ϕ2 corresponding to µ2, where v-th component is set to σ(v). Then, for any n larger than an absolute constant, the eigenvalues λ1 and λ2 of the non-backtracking matrix B satisfies |λi µi| = o(1), and all other eigenvalues of B are confined in a circle with radius (1 + o(1)) q 2 . Also, there exists an eigenvector ν2 of the non-backtracking matrix B associated with λ2 such that 1 8 (a + b)(a b)2 + o(1) := 1 f(a, b) where ξ2 = T Θϕ2 T Θϕ2 , T {0, 1}2m n, Tei = 1{e2 = i} and Θ {0, 1}n 2, Θij = 1 if the vertex i is in the j-th community, and 0 otherwise. The proposition above highlights that the non-backtracking matrix possesses a spectral separation property, even in the case of very sparse graphs. Moreover, the existence of an eigenvector ν2 such that ν2, ξ2 = 1 o(1) suggests that this eigenvector contains information about the community index of vertices. The foundation for these advantageous properties of the non-backtracking matrix B in sparse scenarios can be attributed to the observation that the matrix Bk shares similarities with the k-hop adjacency matrix, while Ak is predominantly influenced by high-degree vertices. Consequently, we can establish the following lemma: Published in Transactions on Machine Learning Research (09/2024) Lemma 8. Suppose we have a SBM with parameters defined in Proposition 8. Then, there exists a function of the eigenvectors of the non-backtracking matrix that can accurately recover the original community index of vertices. The proof for Lemma 8 can be found in Appendix D.3 In the following, we will demonstrate that the nonbacktracking GNN can compute an approximation to the top K eigenvectors mentioned in Lemma 8. To achieve this, we first require the following lemma: Lemma 9. Assuming that the non-backtracking matrix B has a set of orthonormal eigenvectors νi with corresponding eigenvalues λ1 > ... > λK λK+1 ... λ2m, then there exists a sequence of convolutional layers in the non-backtracking GNNs capable of computing the eigenvectors of the non-backtracking matrix. For the proof of Lemma 9, please refer to Appendix D.4. With this lemma in mind, we can observe that a sequence of convolutional layers, followed by a multilayer perceptron, can approximate the function outlined in Lemma 8, leveraging the universal approximation theorem. This argument leads to Proposition 4. D.2 Proof of Proposition 5 Let s rephrase the formal version of Proposition 5 as follows: Proposition 5. Consider two graphs, one generated from a SBM (G) with parameters n, 2, 1 and the other from an Erdős Rényi model (H) with parameters (n, c n), for some constants a, b, c > 1. When (a b)2 > 2(a + b), the non-backtracking graph neural network is capable of distinguishing between G and H with probability tending to 1 as n . To establish Proposition 5, we rely on the following Proposition 9 from Bordenave et al. (2015): Proposition 9. (Bordenave et al., 2015) Suppose we have two graphs G and H as defined in the formal statement of Theorem 5. Then, the eigenvalues λi(B) of the non-backtracking matrix satisfy the following: λ1(BG) = a + b 2 + o(1), λ2(BG) = a b 2 + o(1), and |λk(BG)| 2 + o(1) for k > 2 λ1(BH) = c + o(1) and |λ2(BH)| c + o(1) Proposition 9 informs us that by examining the distribution of eigenvalues, we can discern whether a graph originates from the Erdős Rényi model or the SBM. Leveraging Lemma 9, we can obtain the top two normalized eigenvectors of the non-backtracking matrix using convolutional layers, denoted as ν1 and ν2. Applying the non-backtracking convolutional layer to these vectors yields resulting vectors with ℓ2-norms corresponding to λi(B). Consequently, we can distinguish between the two graphs, G and H, by examining the output of the convolutional layer in the non-backtracking GNN. D.3 Proof of Lemma 8 Proof. Let us revisit the matrix T, defined as Tei = 1{e2 = i}, and its pseudo-inverse denoted as T + = D 1T , where D is a diagonal matrix containing the degrees of vertices on the diagonal. Considering the definition of ξ2 as provided in Proposition 8, we can deduce the label of vertex v. Specifically, if (T +ξ2)v > 0, it implies that the vertex belongs to the first class; otherwise, it belongs to the other class. Additionally, we are aware that (T +ν2 T +ξ2)i 2 = O(f(a, b)) as indicated in Proposition 8, considering the property that the sum of each row of T + is equal to 1. Consequently, by examining the signs of elements in the vector T +ν2, we can classify nodes without encountering any misclassified ones. Published in Transactions on Machine Learning Research (09/2024) D.4 Proof of Lemma 9 Proof. Suppose we have f arbitrary vectors x1, ..., xf and a matrix X = [x1, ..., xf] R2m f, which has x1, ..., xf as columns. Without loss of generality, we assume that f 2m and xv 2 = 1. Let xj = P2m i=1 c(j) i νi for 1 j f. We will prove the lemma by showing that if we multiply X by B and repeatedly apply the Gram Schmidt orthnormalization to the columns of the resulting matrix, the j-th column of the resulting matrix converges towards the direction of νj. Further, we will show that there exists a series of convolutional layers equivalent to this process. First, we need to show Bkx1 converges to the direction of ν1. Bkx1 can be expressed as: Bkx1 = λk 1 c(1) 1 ν1 + c(1) 2 k ν2 + ... + +c(1) 2m Let s fix ϵ > 0 and take K0 such that for all k K0, the inequality λ2 λ1 k < ϵ|c(1) 1 | 4m holds. Furthermore, we define the following for brevity: Akx1 Akx1 2 for ℓ= 1, u(k) ℓ u(k) ℓ 2 where u(k) ℓ = GS(Be(k 1) ℓ ) for ℓ 2 . Then, the distance between e(k) 1 and sign c(1) 1 ν1 is e(k) 1 sign c(1) 1 ν1 2 c(1) 1 ν1 + P2m i=2 λi λ1 k c(1) i νi D sign c(1) 1 ν1 c(1) 1 ν1 sign c(1) 1 Dν1 = |c(1) 1 | + D q P2m i=2 λi λ1 2k (c(1) i )2 where (a) stems from defining (c(1) 1 )2 + P2m i=2 λi λ1 2k (c(1) i )2 as D, (b) and (d) are obtained from the assumption λ2 λ1 k < ϵ|c(1) 1 | 4m , and for (c) we use the inequality p x2 + y2 |x| + |y|. Moreover, we assume that for all 1 ℓ< L and any ϵℓ> 0, there exists Kℓsuch that for all k Kℓ, e(k) ℓ sign c(ℓ) ℓ νℓ 2 < ϵℓ. To simplify the notations, we define K0 = max1 ℓ 0 and ϵℓ= min(1, ϵ0) Published in Transactions on Machine Learning Research (09/2024) for 1 ℓ< L. Then, u(k) L for k K0 can be written as u(k+1) L = Be(k) L X i 0, u(k+p) L is u(k+p) L = X i L νi, Be(k+p 1) L νi y(k+p) L = 1 u(k+p 1) L 2 j L νj, Be(k+p 2) L λjνj By(k+p 2) L νi y(k+p) L = 1 u(k+p 1) L 2 λi νi, Be(k+p 2) L νi, By(k+p 2) L νi y(k+p) L = 1 u(k+p 1) L 2 i L νi, Be(k+p 2) L λiνi 1 u(k+p 1) L 2 i L νi, By(k+p 2) L νi y(k+p) L = 1 Qp 1 j=1 u(k+j) L 2 i L νi, Be(k) L λp 1 i νi 1 u(k+j) L 2 i L νi, By(k+j 1) L λp j 1 i νi = 1 Qp 1 j=1 u(k+j) L 2 i L νi, Be(k) L λp 1 i νi 1 u(k+j) L 2 i L νi, By(k+j 1) L λp j 1 i νi (a) = C0λp 1 L νL, Be(k) L νL + X i>L νi, Be(k) L λi j=1 Cj νi, By(k+j 1) L λp j 1 i νi y(k+p) L , Published in Transactions on Machine Learning Research (09/2024) where (a) stems from the definitions C0 = 1 Qp 1 j=1 u(k+j) L 2 and Cj = 1 u(k+j) L 2 . Now, define ˆe(k+p) L := u(k+p) L C0λp 1 L νL,Be(k) L . Then, ˆe(k+p) L = νL + X νi, Ae(k) L νL, Be(k) L i L Pp 1 j=1 Cj νi, By(k+j 1) L λp j 1 i νi y(k+p) L C0λp 1 L νL, Be(k) L . (17) Take p0 such that λi λL p0 1 ϵ νL,Be(k) L 4(L N) νi,Be(k) L , then, the ℓ2-norm of the second term of Equation (17) is less than ϵL/4. Meanwhile, the ℓ2-norm of the third term in Equation (17) is upper-bounded by C0λp 1 L νL, Be(k) L λp 2 i 1 λi 1 + 1 where C := maxp 1 j=1 Cj. Therefore, if we take ϵ0 < ϵC0λp 1 L νL,Be(k) L 12Ldmax i L λp 2 i 1 λi 1 + 1 1 , we can get ˆe(k+p) L νL 2 < ϵL/2 for p > p0. Finally, the upper bound of z(k) L 2 is z(k) L 2 e(k+p) L ˆe(k+p) L 2 + ˆe(k+p) L νL 2 (a) < ˆe(k+p) L 2 1 + ϵL where (a) follows from e(k+p) L = ˆe(k+p) L ˆe(k+p) L 2 , and (b) is obtained from the inequality 1 ϵL 2 ˆe(k+p) L 2 Published in Transactions on Machine Learning Research (09/2024) E Experiment Details In this section, we provide details of NBA-GNN implementation and experiments. E.1 Implementation E.1.1 Message Initialization and Aggregation For message initialization and final aggregation of messages, we have proposed functions ϕ, σ, ρ. To be specific, the message initialization can be written as follows: ( ϕ(eij, xi, xj) (if eij exists) ϕ(xi, xj) (otherwise) . For our experiments, we use concatenation for ϕ, weighted sums for σ, and average for ρ. E.1.2 Non-backtracking Updates Here, we provide the details of using the non-backtracking operator in our NBA-GNNS. To begin, let s revisit the non-backtracking matrix B {0, 1}2|E| 2|E| from Section 4.1 defined as follows: B(ℓ k),(j i) = ( 1 if k = j, ℓ = i 0 otherwise . Returning to our NBA-GNN, the non-backtracking message passing update for a hidden feature h(t) j i at the (t + 1)-th layer, introduced in Section 3.2, is expressed as follows: h(t+1) j i = ϕ(t) h(t) j i, n ψ(t) h(t) k j, h(t) j i : k N(j) \ {i} o , (3) where ϕ(t) and ψ(t) are backbone-specific non-linear update and permutation-invariant aggregation functions at the t-th layer, respectively. Using the non-backtracking matrix B, we can rewrite Equation (3) as following: H(t+1) = ϕ(t) H(t), ψ(t) B H(t), H(t) , where H(t) R2|E| d are edge-wise features, i.e., each row representing h(t) j i. E.1.3 Implementation Example Now, the NBA-GNN implementation example in Section 3 using GCN (Kipf & Welling, 2017) as a backbone, can be written as the following: h(t+1) j i = h(t) j i + 1 |N(j)| 1W(t) X k N(j)\{i} h(t) k j . Recall the out-degree and in-degree matrices of NBA-GNN, denoted as Dout and Din, where (Dout)(j i),(j i) = P ℓ k B(j i),(ℓ k) and (Din)(j i),(j i) = P ℓ k B(ℓ k),(j i), for each index j i. We also introduced the normalized non-backtracking matrix augmented with self-loops as b B = (Dout + I) 1 2 (B + I)(Din + I) 1 2 . In summary, Equation (5) can be expressed as follows: H(t+1) = b BH(t)W(t) . Hence, the message passing update in NBA-GCN is equivalent to the multiplication of non-backtracking operator and edge-wise representations constructed from node-wise features. Published in Transactions on Machine Learning Research (09/2024) E.2 Long-Range Graph Benchmark E.2.1 Dataset Statistics From LRGB (Dwivedi et al., 2022), we experiment for 3 tasks: graph classification (Peptides-func), graph regression (Peptides-struct), and node classification (Pascal VOC-SP). We provide the dataset statistics in Table 10. Note that for Pascal VOC-SP, we use SLIC compactness of 30, and edge weights are based only on super-pixels coordinates following the recent work (Gutteridge et al., 2023). Table 10: Statistics of datasets in LRGB. Dataset Total Total Avg Mean Total Avg Avg Avg Dataset Graphs Nodes Nodes Deg. Edges Edges Short. Path. Diameter Splits Pascal VOC-SP 11,355 5,443,545 479.40 8.00 43,548,360 3,835.17 8.05 0.18 19.40 0.65 75/12.5/12.5 Peptides-func 15,535 2,344,859 150.94 2.04 4,773,974 307.30 20.89 9.79 56.99 28.72 70/15/15 Peptides-struct 15,535 2,344,859 150.94 2.04 4,773,974 307.30 20.89 9.79 56.99 28.72 70/15/15 E.2.2 Experiments Details All experiment results are averaged over three runs (seed 0 2) and trained for 300 epochs, with a 500k parameter budget. Baseline scores were taken from the LRGB (Dwivedi et al., 2022), Table 1 of DRew (Gutteridge et al., 2023), and papers of each work. We use an Adam W optimizer (Loshchilov & Hutter, 2018) with lr decay=0.1 , min lr=1e-5, momentum=0.9, and base learning rate lr=0.001 (0.0005 for Pascal VOC-SP). We use cosine scheduler with reduce factor=0.5 , schedule patience=10 with 50 warm-up. Laplacian positional encoding was used with hidden dimension 16 and 2 layers. We use the Atom Encoder , Bond Encoder for Peptides-func, Peptides-struct from based on OGB molecular feature (Hu et al., 2020; 2021), and the VOCNode Encoder , VOCEdge Encoder for Pascal VOC-SP. Pascal VOC-SP results in Figure 4a all use the same setup described above. Peptides-func results in Figures 4b and 4c all use the same setup described above. GCN results in Figures 4a to 4c use the hyperparameters from Tönshoff et al. (2023). We searched the following range of hyperparameters, and reported the best in Table 11. We searched layers 6 to 12 for Pascal VOC-SP 2, layers 5 to 20 for Peptides-func and Peptides-struct. The hidden dimension was chosen by the maximum number in the parameter budget. Dropout was searched from 0.0 0.8 for Pascal VOC-SP in steps of 0.1, and 0.1 0.4 in steps of 0.1 for Peptides-func and Peptides-struct. We used the batch size of 30 for Pascal VOC-SP on GPU memory, and 200 for Peptides-func and Peptides-struct. Published in Transactions on Machine Learning Research (09/2024) Table 11: Best hyperparameters for each NBA-GNN and dataset in LRGB. Model Dataset # Param. # Layers hidden dim. dropout Batch size # epochs NBA-GCN Pascal VOC-SP 472k 12 180 0.7 30 200 Peptides-func 510k 10 186 0.1 200 300 Peptides-struct 505k 20 144 0.1 200 300 NBA-GIN Pascal VOC-SP 472k 12 180 0.7 30 200 Peptides-func 502k 10 144 0.1 200 300 Peptides-struct 503k 10 144 0.1 200 300 NBA-Gated GCN Pascal VOC-SP 486k 10 96 0.25 30 200 Peptides-func 511k 10 96 0.1 200 300 Peptides-struct 511k 8 108 0.1 200 300 E.3 Transductive Node Classification E.3.1 Dataset statistics We conducted experiments involving three citation networks (Cora, Cite Seer, and Pubmed in Sen et al. (2008)), and three heterophilic datasets (Texas, Wisconsin, and Cornell in Pei et al. (2019)), focusing on transductive node classification. Our reported results in Table 3 are the averages obtained from 10 different seed runs to ensure robustness and reliability. For the citation networks, we employed the dataset splitting procedure outlined in Yang et al. (2016). In contrast, for the heterophilic datasets, we randomly divided the nodes of each class into training (60%), validation (20%), and testing (20%) sets. We provide more details of dataset statistics in Table 12. E.3.2 Experiment Details The training duration spanned 1,000 epochs for citation networks and 100 epochs for heterophilic datasets. Following training, we selected the best epoch based on validation accuracy for evaluation on the test dataset. We used the Adam W optimizer (Loshchilov & Hutter, 2018) with a learning rate of 3e-5. The model s hidden dimension and dropout ratio were set to 512 and 0.2, respectively, consistent across all datasets, after fine-tuning these hyperparameters on the Cora dataset. Additionally, we conducted optimization for the number of convolutional layers within the set {1, 2, 3, 4, 5}. The results revealed that the optimal number of layers is typically three for most of the models and datasets. However, there are exceptions, such as Cite Seer-Gated GCN, Pub Med-{Graph SAGE, Graph SAGE+NBA+PE}, Wisconsin-{Graph SAGE+NBA, Graph SAGE+NBA+PE} and Cornell-{Graph SAGE, Graph SAGE+NBA, Graph SAGE+NBA+PE}, where the optimal number of layers is found to be four. Furthermore, for Cora-{Graph SAGE+NBA+PE, GAT+NBA}, Cite Seer-Graph SAGE+NBA, the optimal number of layers is determined to be five. Table 12: Statistics of the datasets for the transductive node classification task Dataset Total Num Num Dim Num Graphs Nodes Edges Features Classes Cora 1 2,708 10,556 1,433 7 Cite Seer 1 3,327 9,104 3,703 6 Pub Med 1 19,717 88,648 500 3 Texas 1 183 309 1,703 5 Wisconsin 1 251 499 1,703 5 Cornell 1 183 295 1,703 5 Published in Transactions on Machine Learning Research (09/2024) E.3.3 Baseline implementation In Section 5.3, we compared NBA-GNNs with several baselines to verify the effectiveness of non-backtracking updates. Three baselines that update edge features were considered: Gated GCN (Bresson & Laurent, 2018), EGNN (Gong & Cheng, 2019), and Cens Net (Jiang et al., 2020). Though these baselines update edge features, the six datasets used for transductive node classification do not have initial edge features. Based on each paper and its code, we used the pairwise cosine similarities between corresponding node features as edge features for Cens Net, and encoded each directed edge into a vector v {0, 1}3 for EGNN. Additionally, we utilize the attention-based EGNN model, referred to as EGNN(A) in the original paper.