# graph_transformers_dream_of_electric_flow__f5edf409.pdf Published as a conference paper at ICLR 2025 GRAPH TRANSFORMERS DREAM OF ELECTRIC FLOW Xiang Cheng1, Lawrence Carin1, Suvrit Sra2 1Department of Electrical and Computer Engineering, Duke University, USA 2School of Computation, Information and Technology, Technical University of Munich, Germany {xiang.cheng, lcarin}@duke.edu, s.sra@tum.de We show theoretically and empirically that the linear Transformer, when applied to graph data, can implement algorithms that solve canonical problems such as electric flow and eigenvector decomposition. The Transformer has access to information on the input graph only via the graph s incidence matrix. We present explicit weight configurations for implementing each algorithm, and we bound the constructed Transformers errors by the errors of the underlying algorithms. Our theoretical findings are corroborated by experiments on synthetic data. Additionally, on a real-world molecular regression task, we observe that the linear Transformer is capable of learning a more effective positional encoding than the default one based on Laplacian eigenvectors. Our work is an initial step towards elucidating the inner-workings of the Transformer for graph data. Code is available at https://github.com/chengxiang/Linear Graph Transformer 1 INTRODUCTION The Transformer architecture (Vaswani et al., 2017) has seen great success in the design of graph neural networks (GNNs) (Dwivedi & Bresson, 2020; Ying et al., 2021; Kim et al., 2022; Müller et al., 2023), and it has demonstrated impressive performance in many applications ranging from prediction of chemical properties to analyzing social networks (Yun et al., 2019; Dwivedi et al., 2023; Rong et al., 2020). In contrast to this empirical success, the underlying reasons for why Transformers adapt well to graph problems are less understood. Several works study the expressivity of attention-based architectures (Kreuzer et al., 2021; Kim et al., 2021; 2022; Ma et al., 2023), but such analyses often rely on the ability of neural networks to approximate arbitrary functions, and may require a prohibitive number of parameters. This paper is motivated by the need to understand, at a mechanistic level, how the Transformer processes graph-structured data. Specifically, we study the ability of a linear Transformer to solve certain classes of graph problems. The linear Transformer is similar to the standard Transformer, but softmax-based activation is replaced with linear attention, and MLP layers are replaced by linear layers. Notably, the linear Transformer contains no a priori knowledge of the graph structure; all information about the graph is provided via an incidence matrix B. For unweighted graphs, the columns of B are just { 1, 0, 1}-valued indicator vectors that encode whether an edge touches a vertex; no other explicit positional or structural encodings are provided. Even in this minimal setup, we are able to design simple configurations of the weight matrices of a Transformer that enable it to solve fundamental problems such as electric flow and Laplacian eigenvector decomposition. Furthermore, we provide explicit error bounds that scale well with the number of Transformer layers. Several of our constructions rely crucially on the structure of the linear attention module, and may help shed light on the success of attention-based GNNs. We hope that our analysis paves the way to a better understanding of the learning landscape of graph Transformers, such as concrete bounds on their generalization and optimization errors. Besides enhancing our understanding of Transformers, our results are also useful for the practical design of graph Transformers. In Sections 3 and 4, we show that the same linear Transformer architecture is capable of learning a number of popular positional encodings (PE). In Section 6, we provide experimental evidence that the linear Transformer can learn better PEs than hard-coded PEs. Published as a conference paper at ICLR 2025 Lemma Task Transformer Implements: # layers for ε error 1 Electric Flow (L ) Gradient Descent log(1/ε) 4 Electric Flow (L ) Multiplicative Expansion log log(1/ε) 2 Resistive Embedding ( L ) Power Series log(1/ε) 3 Heat Kernel (e s L) Power Series log(1/ε) + sλmax 5 Heat Kernel (e s L) Multiplicative Expansion log(1/ε) + log(sλmax) 6 & 7 k-Eigenvector Decomp. of L Subspace Iteration (subspace-iteration steps)/k Table 1: Summary of Error Bounds proved for various Transformer constructions. L is the graph Laplacian; L denotes its pseudoinverse. 1.1 SUMMARY OF CONTRIBUTIONS Below, we summarize the main contributions of our paper. 1. We provide explicit weight configurations for which the Transformer implements efficient algorithms for several fundamental graph problems. These problems serve as important primitives in various graph learning algorithms, and have also been useful as PEs in state-of-the-art GNNs. (a) Lemma 1 constructs a Transformer that solves electric flows by implementing steps of gradient descent to minimize flow energy; consequently, it can also invert the graph Laplacian. (b) Lemmas 2 and 3 construct Transformers that compute low-dimensional resistive embeddings and heat kernels. Both constructions are based on implementing suitable power series. (c) By implementing a multiplicative polynomial expansion, Lemma 4 provides a construction for electric flow with exponentially higher accuracy than Lemma 1. Similarly, Lemma 5 provides a construction that computes the heat kernel using much fewer layers than Lemma 3. (d) In Lemma 6, we show that the Transformer can implement subspace iteration for finding the top-k (or bottom-k) eigenvectors of the graph Laplacian. Central to this analysis is the ability of self-attention to compute a QR decomposition of the feature vectors. We derive explicit error bounds for the Transformer based on the convergence rates of the underlying algorithm implemented by the Transformer. We summarize these results in Table 1. 2. In Section 5, we propose a more efficient version of the linear Transformer with much fewer parameters. We show that this more efficient linear Transformer can nonetheless implement all the above-mentioned constructions. Further, we show in Lemma 10 that the parameter-efficient linear Transformer has desirable invariance and equivariance properties. 3. We test the empirical performance of our theory on synthetic random graphs. In Section 3.5, we verify that Transformers with a few layers can achieve high accuracy in computing electric flows, resistive embeddings, as well as heat kernels. In Section 4.1, we verify that the Transformer can accurately compute top-k and bottom-k eigenevectors of the graph Laplacian. 4. In Section 6, we demonstrate the advantage of using the linear Transformer as a replacement for Laplacian eigenvector PE on a molecular regression benchmark using the QM9 and ZINC datasets (Ruddigkeit et al., 2012; Ramakrishnan et al., 2014; Irwin et al., 2012). After replacing the Laplacian eigenvector-based PE with the linear Transformer, and training on the regression loss, we verify that the linear Transformer automatically learns a good PE for the downstream regression task that can outperform the original Laplacian eigenvector PE by a wide margin. 1.2 RELATED WORK Numerous authors have proposed different ways of adapting Transformers to graphs (Dwivedi & Bresson, 2020; Ying et al., 2021; Rampášek et al., 2022; Müller et al., 2023). A particularly promising approach is to use a suitable positional encoding to incorporate structural information in the input, examples include Laplacian eigenvectors (Dwivedi & Bresson, 2020; Kreuzer et al., 2021), heat kernel (Choromanski et al., 2022), resistance distance and commute time (Ma et al., 2023; Velingker et al., 2024; Zhang et al., 2023) and shortest path distance (Ying et al., 2021). Lim et al. (2022) designed a neural network to transform eigenvectors into an encoding that has certain invariance properties. Black et al. (2024) compared the expressivity of different PE schemes. (Srinivasan & Ribeiro, 2020) studied the relationship between PE and structural graph representations. Published as a conference paper at ICLR 2025 Kim et al. (2021) explore the possibility of using pure Transformers for graph learning. They provide both nodes and edges as input tokens, with a simple encoding scheme. They also prove that such a Transformer is as expressive as a second-order invariant graph network. A number of works have explored the use of learned PEs (Mialon et al., 2021; Eliasof et al., 2023; Park et al., 2022; Ma et al., 2023; Kreuzer et al., 2021; Dwivedi et al., 2021). In particular Ma et al. (2023) is based on relative random walk probabilities (RRWP), and Kreuzer et al. (2021) s approach is based on learned eigenfunctions. In comparison, using a linear-Transformer to learn PEs has two advantages: 1. self attention can implement operations such as orthogonalization, necessary for learning multiple eigenvectors (Lemma 6). Second, self-attention enables highlyefficient algorithms based on multiplicative polynomial expansions (Lemma 4). These enable linear Transformers to learn good PEs given only the incidence matrix as input. Finally, we note several relevant papers that are unrelated to graph neural networks. A recent body of work proves the ability of Transformers to implement learning algorithms, to explain the in-context learning phenomenon (Schlag et al., 2021; Von Oswald et al., 2023). Surprisingly, our construction for Lemma 1 bears several remarkable parallels to the gradient descent construction by Von Oswald et al. (2023). We conjecture that the proof of Lemma 4 may be applicable for understanding the GD++ algorithm in the same paper. In another direction, Charton (2021) show experimentally that Transformers can compute linear algebra operations such as eigenvector decomposition. Their work requires a relatively complicated matrix encoding and a large number of parameters. 2 PRELIMINARIES AND NOTATION We use G = (V, E) to denote a graph with vertex set V and edge set E; let n := |V| and d := |E|. We often identify the vertex vi with its index i for i = 1...n, and the edge ej with j for j = 1...d. In general, G has weighted edges, defined by r( ) : E R+. We let rj := r(ej). A central object of interest is the incidence matrix B Rn d, defined as follows: to each edge ej = (u v) E, assign an arbitrary orientation, e.g. ej = (u v). The incidence matrix B is given by 1/ rj, if exists v V such that ej = (ui v) +1/ rj, if exists v V such that ej = (v ui) 0, otherwise. (1) We define the graph Laplacian as L := BB Rn n. We use λmax to denote the maximum eigenvalue of L. Note that L always has 0 as its smallest eigenvalue, with corresponding eigenvector 1 n 1, the all-ones vector. This fact can be verified by noticing that each column of B sums to 0. For a connected graph (as we will assume is the case throughout the paper), the second-smallest eigenvalue is always non-zero, and we will denote it as λmin. Finally, we define ˆIn n := In n 1 n 1 1 . ˆIn n is the projection onto span(L), and functions as the identity matrix when working within span(L). 2.2 LINEAR TRANSFORMER We will use Z0 Rh n to denote the input to the Transformer. Z0 encodes a graph G, and each column of Z0 encodes a single vertex in h dimensions. Let W Q, W K, W V Rh h denote the key, query and value parameter matrices. We define linear attention Attn as Attn W V ,W Q,W K(Z) := W V ZZ W Q W KZ. (2) Unlike standard attention, (2) does not contain softmax activation. We construct an L-layer Transformer by stacking L layers of the attention module (augmented with a linear operation on the skip connection). To be precise, let Zl denote the output of the (l 1)th layer of the Transformer. Then Zl+1 := Zl + Attn W V l ,W Q l ,W K l (Zl) + W RZl, (3) where W V l , W Q l , W K l are the value, query and key weight matrices of the linear attention module at layer l, and W R Rh h is the weight matrix of the linear operation. Henceforth, we let W V := {W V l }l=0...L, W Q := {W Q l }l=0...L, W K l := {W K l }l=0...L, W R l := {W R l }l=0...L denote collections of the parameters across all layers of an L-layer Transformer. Finally, we will often need to refer to specific rows of Zl. We will use [Zl]i...j to denote rows i to j of Zl. Published as a conference paper at ICLR 2025 3 TRANSFORMERS AS POWERFUL SOLVERS FOR LAPLACIAN PROBLEMS In this section, we discuss the capacity of the linear Transformer 3 to solve certain classes of canonical graph problems. We begin with the problem of Electric Flow in Section 3.2, where the constructed Transformer can be interpreted as implementing steps of gradient descent with respect to the energy of the induced flow. Subsequently, in Section 3.3, we provide constructions for computing the resistive embedding, as well as the heat kernel, based on implementing a truncated power series. Finally, in Section 3.4, we provide faster alternative constructions for solving electric flow and computing heat kernels, based on implementing a multiplicative polynomial expansion. In each case, we bound the error of the Transformer by the convergence rate of the underlying algorithms. 3.1 ADDITIONAL SETUP We introduce some additional setup that is common to many lemmas in this section. We will generally consider an L-layer Transformer, as defined in (3), for some arbitrary L Z+. As in (3), we use Zl to denote the input to layer l. The input Z0 will encode information about a graph G, along with a number of demand vectors ψ1 . . . ψk Rn. We use Ψ to denote the n k matrix whose ith column is ψi. Unless otherwise stated, Zl R(d+2k) n, where n is the number of nodes, d is the number of edges, and k is the number of demands/queries. Z 0 := [B, Ψ, 0n k]. On parameter size: In a straightforward implementation, the above Transformer has feature dimension h = (d + 2k). The size of W Q, W K, W V , W R is O(h2) = O(d2 + k2), which is prohibitively large as d can itself be O(n2). The size of parameter matrices can be significantly reduced to O(k2) by imposing certain constraints on the parameter matrices; we present this reduction in (7) in Section 5. For simplicity of exposition, lemmas in this section will use the naive implementation in (3). We verify later that all the constructions presented in this section can also be realized in (7). 3.2 SOLVING ELECTRIC FLOW WITH GRADIENT DESCENT Assume we are given a graph G = (V, E), along with a non-negative vector of resistances r Rd +. Let R be the d d diagonal matrix with r on its diagonal. A flow is represented by f Rd, where fj denotes the (directed) flow on edge ej. The energy of an electric flow is given by Pd j=1 rjf 2 j . Let ψ Rn denote a vector of demands. Throughout this paper, we will assume that the demand vectors satisfy flow conservation, i.e., ψ, 1 = 0. The ψ-electric flow is the unique minimizer of the following (primal) flow-optimization problem (by convex duality, this is equivalent to a dual potential-optimization problem): (primal) min f Rd j=1 rjf 2 j subject to the constraint BR1/2f = ψ. (4) (dual) min ϕ Rn ϕ Lϕ 2ϕ ψ. (5) The argument is standard; for completeness, we provide a proof of equivalence between (4) and (5) in (8) in Appendix 10.1. It follows that the optimizer ϕ of (5) has closed-form solution ϕ = L ψ. In Lemma 1 below, we show a simple construction that enables the Transformer in (3) to compute in parallel, the optimal potential assignments for a set of k demands {ψi}i=1...k, where ψi Rn. Motivation: Effective Resistance Metric An important practical motivation for solving the electric flow (or equivalently computing L ) is to obtain the Effective Resistance matrix R Rn n. GNNs that use positional encodings derived from R have demonstrated state-of-the-art performance on numerous tasks, and can be shown to have good theoretical expressivity (Zhang et al., 2023; Velingker et al., 2024; Black et al., 2024). Formally, R is defined as Rij := (ui uj) L (ui uj), where ui denotes the vector that has a 1 in the ith coordinate, and 0s everywhere else. Intuitively, Rij is the potential drop required to send 1-unit of electric flow from node i to node j. Let ℓ Rn denote the vector of diagonals of L (i.e. ℓi := Lii). Then R = 1ℓ + ℓ 1 2L ; thus computing L essentially also computes R. Lemma 1 (Transformer solves Electric Flow by implementing Gradient Descent) Consider the setup in Section 3.1. Assume that ψi, 1 = 0 for each i = 1...k. For any δ > 0 and for any Published as a conference paper at ICLR 2025 L-layer Transformer, there exists a choice of weights W V , W Q, W K, W R, such that each layer of the Transformer (3) implements a step of gradient descent with respect to the dual electric flow objective in (5), with stepsize δ. Consequently, the following holds for all i = 1...k and for any graph Laplacian with maximum eigenvalue λmax 1/δ and minimum nontrivial eigenvalue λmin: [ZL] d+k+i L ψi 2 exp ( δLλmin/2) λmin ψi 2. Discussion. In the special case when k = n and Ψ := In n 1 n 1 1 , [ZL]d+k+1...d+2k L . An ε-approximate solution requires L = O(log(1/ε)) layers; this is exactly the convergence rate of gradient descent on a Lipschitz smooth and strongly convex function. We defer details of the proof of Lemma 1 to Appendix 10.1, and we provide the explicit choice of weight matrices in (10). On a high level, we show that a single attention layer can implement exactly a single gradient descent step wrt the dual objective Fψ (5), i.e. [Zl+1] d+k+i := [Zl] d+k+i δ Fψi([Zl] d+k+i). The log(1/ε) rate then follows immediately from the convergence rate of gradient descent, combined with smoothness and (restricted) strong convex of Fψ. In Lemma 4 below, we show an alternate construction that reduces this to O(log log(1/ε)). We highlight that the constructed weight matrices are very sparse; each of W V , W Q, W K, W R contains a single identity matrix in a sub-block. This sparse structure makes it possible to drastically reduce the number of parameters needed, which we exploit in Section 5 to design a more parameter efficient Transformer. We provide experimental validation of Lemma 1 in Section 3.5. 3.3 IMPLEMENTING TRUNCATED POWER SERIES In this section, we present two more constructions: one for computing L (Lemma 2), and one for computing the heat kernel e s L (Lemma 3). Both quantities have been successfully used for positional encoding in various GNNs. The constructions proposed in these two lemmas are also similar, and involve implementing the power series of the respective targets. 3.3.1 COMPUTING THE PRINCIPAL SQUARE ROOT Motivation : Resistive Embedding The following fact relates the effective resistance matrix R to any square root of L : Fact 1 Let M denote any matrix that satisfies MM = L . Let R Rn n + denote the matrix of effective resistances (see Section 3.2). Then Rij = Mi Mj 2 2, where Mi is the ith row of M. One can verify the above by noticing that Mi Mj 2 2 = M i Mi + M j Mj 2M i Mj = [L]ii + [L]jj 2Lij. In some settings, it is more natural to use the rows of M to embed node position, instead of directly using R: By assigning an embedding vector wi := Mi to vertex i, the Euclidean distance between wi and wj equals the resistance distance. The matrix M is underdetermined, and for any m > n, there are infinitely many choices of M that satisfy MM = L . Fact 1 applies to any such M. Velingker et al. (2024) uses the rows of M = L B for resistive embedding. Under this choice, Mi has dimension d = |E|, which can be quite large. To deal with this, they additionally performs a dimension-reduction step using Johnson Lidenstrauss. Among all valid choices of M, there is a unique choice that is symmetric and minimizes M F , namely, US 1/2U , where USU = L is the eigenvector decomposition of L. We reserve to denote this matrix; L is called the principal square root of L . In practice, L might be preferable to, say, L B because it has an embedding dimension of n, as opposed to the possibly much larger d. We present in Lemma 2 a Transformer construction for computing Lemma 2 (Principal Square Root L ) Consider the setup in Section 3.1. Assume that ψ1...ψk Rn satisfy ψi, 1 = 0. For any L-layer Transformer (3), there exists a configuration of weights W V , W Q, W K, W R such that the following holds: For any graph with maximum Laplacian eigenvalue less than λmax and minimum non-trivial Laplacian eigenvalue greater than λmin, and for all i = 1...k: [ZL] d+k+i L ψi 2 e Lλmin/λmax L/λmax ψi 2. Published as a conference paper at ICLR 2025 Discussion. We defer a proof of Lemma 2 to Appendix 10.2. The high-level idea is that each layer of the Transformer implements one additional term of the power series expansion of L . Under the choice k = n and Ψ = In n (1/n) 1 1 , [ZL]d+k+1...d+2k L . An ε approximation requires log(1/ε) layers. We consider the more general setup involving arbitrary ψ is as they are useful for projecting the resistive embedding onto a lower-dimensional space; this is relevant when ψi s are trainable parameters (see e.g., the setup in Appendix 14.3). We empirically validate Lemma 2 in Section 3.5. 3.3.2 COMPUTING THE HEAT KERNEL: exp ( s L) Finally, we present a result on learning heat kernels. The heat kernel has connections to random walks and diffusion maps Coifman & Lafon (2006). It plays a central role in semi-supervised learning on graphs (Xu et al., 2020); it is also used for positional encoding (Choromanski et al., 2022). Note that sometimes, the heat kernel excludes the nullspace of L and is defined as e s L 1 1 /n. Lemma 3 (Heat Kernel e s L) Consider the setup in Section 3.1. Assume that ψ1...ψk Rn satisfy ψi, 1 = 0. Let s > 0 be an arbitrary temperature parameter. There exists a configuration of weights for the L-layer Transformer (3) such that the following holds: for any input graph whose Laplacian L satisfies 8sλmax L, and for all i = 1...k, [Z] L,i e s Lψi 2 2 L+8sλmax+1 ψi 2 Discussion. We defer the proof of Lemma 3 to Appendix 10.3. To obtain ε approximation error, we need number of layers L O(log(1/ε) + sλmax). As with the preceding lemmas, the flexibility of choosing any number of ψi s enables the Transformer to learn a low-dimensional projection of the heat kernel. The dependence on sλmax is fundamental, stemming from the fact that the truncation error of the power series of e s L begins to shrink only after O(sλmax) the first terms. In Lemma 5 in the next section, we weaken this dependence from sλmax to log(sλmax). We empirically validate Lemma 3 in Section 3.5. 3.4 IMPLEMENTING MULTIPLICATIVE POLYNOMIAL EXPANSION We present two alternative Transformer constructions that can achieve vastly higher accuracy than the ones in preceding sections: Lemma 4 computes an ε-accurate electric flow in exponentially fewer layers than Lemma 1. Lemma 5 approximates the heat kernel with higher accuracy than Lemma 3 when the number of layers is small. The key idea in both Lemmas 4 and 5 is to use the Transformer to implement a multiplicative polynomial; this in turn makes key use of the self-attention module. The setup for Lemmas 4 and 5 differs in two ways from that presented in Section 3.1. First, the input to layer l, Zl, are now in R3n n, instead of R(d+2k) n. When the graph G is sparse and the number of demands/queries k is small, the constructions in Lemma 1 and 3 may use considerably less memory. This difference is fundamental, due to the storage required for raising matrix powers. Second, the input is also different; in particular, information about the graph is provided via L as part of the input Z0, as opposed to via the incidence matrix B as done in Section 3.1. This difference is not fundamental: one can compute L = B B from B in a single attention layer; in the spirit of brevity, we omit this step. We begin with the faster construction for electric flow: Lemma 4 Let δ > 0. Let Z 0 := (ˆIn n δL), In n, δIn n . Then there exist a choice of W V , W Q, W K, W R for a L-layer Transformer (3) such that for any graph with Laplacian with smallest non-trivial eigenvalue λmin and largest eigenvalue λmax 1/δ, [ZL]2n+1...3n L 2 1 λmin exp δ2Lλmin . Discussion. Lemma 4 shows that the Transformer can compute an ε-approximation to L (which is sufficient but not necessary for solving arbitrary electric flow demands) using log log(1/ε) layers. This is much fewer than the log(1/ε) layers required in Lemma 1. The key idea in the proof is to implement a multiplicative polynomial expansion for L . We defer the proof to Appendix 10.4. Next, we show the alternate construction for computing the heat kernel: Published as a conference paper at ICLR 2025 Lemma 5 (Fast Heat Kernel) Let s > 0. Let L be the number of Transformer layers. Let Z 0 := (In n 3 Ls L). Then there exist a choice of W V , W Q, W K, W R such that for any graph whose Laplacian L satisfies sλmax 3L, ZL exp( s L) 2 3 L+1s2λ2 max. Discussion. Lemma 5 shows that the Transformer can compute an ε-approximation to e s L using O(log(1/ε)+log(sλmax)) layers. The ε dependence is the same as Lemma 3, but the log(sλmax) is an improvement. When the number of layers is small, Lemma 5 gives a significantly more accurate approximation. The proof is based on the well-known approximation ex (1 + x/L)L. We defer proof details for Lemma 5 to Appendix 10.4. 3.5 EXPERIMENTS In Figure 1, we experimentally verify that the Transformer is capable of learning to solve the three objectives presented in Lemmas 1, 2 and 3. The setup is as described in Section 3.1, with the Transformer described in (3) with k = n, but with the following important difference: the output of each layer is additionally normalized as follows: [Zl]1...n [Zl]1...n/ [Zl]1...n F , [Zl]n+1...n+2k [Zl]n+1...n+2k/ [Zl]n+1...n+2k F , where F is the Frobenius norm. Without this normalization, training the Transformer becomes very difficult beyond 5 layers. We consider two kinds of random graphs: fully-connected graphs (n = 10, d = 45) and Circular Skip Links (CSL) graphs (n = 10, d = 20). Edge resistances are randomly sampled. We provide details on the sampling distributions in Appendix 13. For each input graph G, we sample n demands ψ1...ψn Rn independently from the unit sphere. Let Ψ = [ψ1...ψn]. The input to the Transformer is B , Ψ , 0n n , consistent with the setup described in Section 3.1. The training/test loss is given by loss U := E [Zl] d+n+i [Zl]d+n+i 2 Uψi Uψi 2 , where U n L , L , e 0.5Lo . We learn the correct solutions only up to scaling, because we need to normalize Zl per-layer. The expectation is over randomness in sampling L and Ψ. We plot the log of the respective losses against number of layers after training has converged. As can be seen, in each plot, and for both types of architectures, the loss appears to decrease exponentially with the number of layers. 1 3 5 7 9 # Layer CSL-linear FC-linear 1 3 5 7 9 # Layer CSL-linear FC-linear 1 3 5 7 9 # Layer CSL-linear FC-linear 1(a) loss L (electric) 1(b) loss L (resistive) 1(c) lossexp ( 0.5L) (heat) Figure 1: loss U against number of layers at convergence for U n L , L , e 0.5Lo . 4 TRANSFORMERS CAN IMPLEMENT SUBSPACE ITERATION TO COMPUTE EIGENVECTORS We present a Transformer construction for computing the eigenvectors of the graph Laplacian L. The Laplacian eigenvectors are commonly used for positional encoding (see Section 1.2). In particular, the eigenvector for the smallest non-trivial eigenvalue has been applied with great success in graph segmentation (Shi & Malik, 1997) and clustering (Bühler & Hein, 2009). Our construction is based on the subspace iteration algorithm (Algorithm 1), aka block power method see (Bentbib & Kanber, 2015). The output Φ of Algorithm 1 converges to the top-k eigenvectors of L. In Corollary 7, we present a modified construction that instead computes the bottom-k eigenvectors of L. For the purposes of this section, we consider a variant of the Transformer defined in (3). ˆZl+1 := Zl + Attn W V l ,W Q l ,W K l (Zl) + W RZl Zl+1 = normalize( ˆZl+1), (6) Published as a conference paper at ICLR 2025 ALGORITHM 1 Subspace Iteration Φ0 Rn k has full column rank . while not converged do ˆΦi+1 LΦi Φi+1 QR(ˆΦi+1) QR(Φ) returns the Q matrix in the QR decomposition of Φ. end while where normalize(Z) applies row-wise normalization: [normalize(Z)]i [Z]i/ [Z]i 2 for i = d+1...d+k. We use Bl Rn d and Φl Rn k to denote refer to specific columns of Z l , defined as Z l =: BT l , Φ l . The notation is chosen as Φl in Zl in (6) corresponds to Φl in Algorithm 1. We initialize B0 = B and let Φ0 be some arbitrary matrix with full column rank. Φ1 becomes orthogonal after the first QR step of Algorithm 1. Lemma 6 (Subspace Iteration for Finding Top k Eigenvectors) Consider the Transformer defined in (6). There exists a choice of W V , W Q, W K, W R such that k + 1 layers of the Transformer implements one iteration of Algorithm 1. Consequently, the output ΦL of a L-layer Transformer approximates the top-k eigenvectors of L to the same accuracy as L/(k + 1) steps of Algorithm 1. Discussion. We bound the Transformer s error by the error of Algorithm 1, but do not provide an explicit convergence rate. This omission is because the convergence rate of Algorithm 1 itself is difficult to characterize, and has a complicated dependence on pairwise spectral gaps of Φ0. The high-level proof idea is to use self-attention to orthogonalize the columns of Φl (Lemma 8. We defer the proof of Lemma 6 to Appendix 11. We experimentally validate Lemma 6 in Section 4.1. The construction in Lemma 6 explicitly requires k layers to perform a QR decomposition of Φ plus 1 layer to multiply by L. The layer usage can be much more efficient in practice: First, multiple L-multiplications can take place before a single QR-factorization step. Second, the k layers for performing QR decomposition can be implemented in a single layer with k parallel heads. In graph applications, one is often interested in the bottom-k eigenvectors of L. The following corollary shows that a minor modification of Lemma 6 will instead compute the bottom k eigenvectors. Corollary 7 (Subspace Iteration for Finding Bottom k Eigenvectors) Consider the same setup as Lemma 6. Let µ > 0 be some constant. There exists a construction for a L-layer Transformer (similar to Lemma 6), which implements Algorithm 1 with L replaced by µI L, and ΦL approximates the bottom k eigenvectors of L if λmax(L) µ. We provide a short proof in Appendix 11. The key idea is that a minor modification of the construction in Lemma 6 will instead compute the bottom k eigenvectors. Alternative to the construction in Corollary 7, one can also first compute L (via Lemma 4), followed by subspace iteration for L . 4.1 EXPERIMENTS FOR LEMMA 6 We verify Lemma 6 and Corollary 7 experimentally by evaluating the ability of the Transformer (3) to learn top-k and bottom-k eigenvectors. As in Section 3.5, we consider two kinds of random graphs with n = 10 nodes: fully connected (d = 45 edges) and CSL (d = 20 edges); each edge is has a randomly sampled resistance; see Appendix 13 for details. For a graph G with Laplacian L, let λ1 λ2 ...λ10 denote its eigenvalues. Let v1, . . . , v10 denote its eigenvectors. λ1 is always 0 and v1 is always 1/ n. The Transformer architecture is as defined in Section 6, with k = n. We increase the dimension of Φl to (2n) n, and thus the dimension of Zl to (d + n + n) n. We read out the last n rows of ZL as output. The purpose of increasing the dimension is to make the architecture identical to the one used in the experiments in Section 3.5; the construction in Lemmas 6 and Corollary 7 extend to this setting by setting appropriate parameters to 0. In addition, we also normalize [Zl]1...d each layer by its Frobenius norm, i.e., [Zl]1...d [Zl]1...d/ [Zl]1...d F (the proof of Lemma 6 still holds as subspace iteration is scaling-invariant). The input to the Transformer is Z 0 = [B, Φ0], and we make Φ0 R2n n a trainable parameter along with W V , W Q, W K, W R. We define lossi := E min ϕL,i vi 2 2, ϕL,i + vi 2 2 , where ϕL,i is the ith row of ΦL and expectation is taken with Published as a conference paper at ICLR 2025 respect to randomness in sampling the graph; the min is required because eigenvectors are invariant to sign-flips. We train and evaluate the Transformer on two kinds of losses: loss1 5 := 1 5 P5 i=1 lossi and loss1 10 := 1 10 P10 i=1 lossi. We plot the results in Figure 2, and summarize our findings below: 1. 2(a) and 2(d): Both loss1 5 and loss1 10 appear to decrease exponentially with the number of Transformer layers, for both FC and CSL graphs. This is consistent with Lemma 6 and Corollary 7. 2. In Figures 2(e) and 2(f), when the Transformer is trained on loss1 10, larger eigenvectors are learned more accurately, with the exception of loss2. This is mostly consistent with our construction in Lemma 6, where the larger eigenvectors are computed more quickly. In contrast, in Figures 2(b) and 2(c), when the Transformer is trained on loss1 5, the smaller eigenvectors are learned more accurately. This is consistent with our construction in Corollary 7, where the smaller eigenvectors of L are also the larger eigenvectors of αI L, and are thus computed more quickly. One expects to observe the same ordering if the Transformer is instead computing L first, followed by subspace iteration. We omit loss1 from Figures 2(b), 2(c), 2(e), 2(f) because v1 is a constant so loss1 goes to 0 extremely fast, making it hard to see the other lines. 1 3 5 7 9 # Layers fully connected CSL 1 3 5 7 9 # Layer 1 3 5 7 9 # Layer 2(a) loss1 5 2(b) {lossi}i=2...5, FC 2(c) {lossi}i=2...5, CSL 1 3 5 7 9 # Layers fully connected CSL 1 3 5 7 9 # Layer 2 3 4 5 6 7 8 9 10 1 3 5 7 9 # Layer 2 3 4 5 6 7 8 9 10 2(d) loss1 10 2(e) {lossi}i=2...10, FC 2(f) {lossi}i=2...10, CSL Figure 2: log(loss ) vs. number of layers. Top row: various losses for the Transformer trained on loss1 5. Bottom row: various losses for the Transformer trained on loss1 10. 5 PARAMETER-EFFICIENT IMPLEMENTATION As explained in Section 3.1, the sizes of the W Q, W K, W V , W R matrices can scale with O(n4) in the worst case, where n is the number of nodes. This makes the memory requirement prohibitive. To alleviate this, we present below in (7) a more efficient Transformer implementation than the one defined in (3). The representation in (7) is strictly more constrained than (3), but it is still expressive enough to implement all the previous constructions. For each layer l, let αV l , αQ l , αK l , αR l be scalar weight parameters, and let W V,Φ l , W Q,Φ l , W K,Φ l , W R,Φ l R2k 2k. Let Bl, Φl evolve as B l+1 = (1 + αR l )B l + αV l B l αQ l αK l Bl B l + Φl W Q,Φ l W K,Φ l Φ l (7) Φ l+1 = I + W R,Φ l Φ l + W V,Φ l Φ l αQ l αK l Bl B l + Φl W Q,Φ l W K,Φ l Φ l , with initialization B0 := B, and Φ0 = [Ψ, 0k n]. We verify that (7) matches (3), with Zl = B l , Φ l ; the difference is that certain blocks of W V l , W Q l , W K l , W R l constrained to be zero or Published as a conference paper at ICLR 2025 some scaling of identity. Nonetheless, we verify that the constructions in Lemmas 1, 2, 3 and 6 can all be realized within the constrained dynamics (7). We prove this in Lemma 9 in Appendix 12. In Figure 15, we show that the efficient implementation (7) performs similarly (and in many cases better than) the standard implementation (3) on all the synthetic experiments. Besides the reduced parameter size, we highlight two additional advantages of (7): 1. Let TFB l (B, Φ) (resp TFΦ l (B, Φ)) be defined as the value of B l (resp Φ l ) when initialized at B0 = B and Φ0 = Φ under the dynamics (7). Let U Rd d be some permutation matrix over edge indices. Then (a) TFB l is equivariant to edge permutation, i.e. TFB l (BU, Φ) = U TFB l (B, Φ) (b) TFΦ l is invariant to edge permutation, i.e. TFΦ l (BU, Φ) = TFΦ l (B, Φ). We provide a short proof of this in Lemma 10 in Appendix 12. 2. If Bl is sparse, then B l Bl can be efficiently computed using sparse matrix multiply. This is the case for all of our constructions for all layers l. 6 LEARNING POSITIONAL ENCODING FOR MOLECULAR REGRESSION In Sections 3.5 and 4.1, we saw that the Transformer is capable of learning a variety of embeddings based on L , L, e s L, and EVD(L), when the training loss is explicitly the recovery loss for that embedding. An natural question is then: Can a GNN perform as well on a downstream task if we replace its PE by a linear Transformer, and train the GNN together with the linear Transformer? There are two potential benefits to this approach: First, the Transformer can learn a PE that is better-suited to the task than hard-coded PEs, and thus achieve higher accuracy. Second, the Transformer with a few layers/dimensions may learn to compute only the most relevant features, and achieve comparable accuracy to the hard-coded PE using less computation. To test this, we evaluate the performance of our proposed linear Transformer on a molecular regression task on two real-world datasets: QM9 (Ruddigkeit et al., 2012; Ramakrishnan et al., 2014) and ZINC (Irwin et al., 2012). The regression target is constrained solubility (Dwivedi et al., 2023). Our experiments are based on the Graph Transformer (GT) implementation from Dwivedi & Bresson (2020). In Table 2, we compare three loss values: GT without PE, GT with Laplacian Eigenvector as PE (Lap PE), and GT with Linear Transformer output as PE. We use Lap PE as a baseline because it was the PE used in Dwivedi & Bresson (2020). We present details of the datasets in Appendix 14.1. The linear Transformer architecture is a modified version of (7), detailed in Appendix 14.2. Other experiment details, including precise model definitions, can be found in Appendix 14.3. Model # Parameters Loss Graph Transformer 800771 891 0.286 0.0078 Graph Transformer + Lap PE 800771 0.201 0.0034 Graph Transformer + Linear Transformer 800771 + 488 0.138 0.012 Graph Transformer 799747 512 0.419 0.0047 Graph Transformer + Lap PE 799747 0.227 0.0094 Graph Transformer + Linear Transformer 799747 + 240 0.221 0.0060 Table 2: Regression Loss for ZINC and QM9 for different choices of PE. The difference between GT and GT + Lap PE is substantial for both QM9 and ZINC, highlighting the importance of PE in both cases. Going from GT + Lap PE to GT + Linear Transformer for ZINC, there is a significant further improvement in loss (about 30%). This is remarkable, considering that the linear Transformer accounts for less than 0.7% of the total number of parameters. Note that the SOTA error for ZINC regression is significantly lower than 0.138 on more recent architectures; the significance of our result is in demonstrating the improvement just by replacing the Lap PE with the linear Transformer, while keeping everything else fixed. In contrast, going from GT + Lap PE to GT + Linear Transformer for QM9, the difference is essentially zero. We conjecture that this may be because QM9 molecules (average of about 9 nodes and 19 edges) are considerably smaller than ZINC (average of about 23 nodes and 40 edges). Thus there may not be too many additional useful features to learn, and Lap PE close to optimal. It is still consistent with our theory, for the Linear Transformer to do as well as Lap PE. Published as a conference paper at ICLR 2025 7 ETHICS STATEMENT The authors have adhered to ICLR s code of ethics during the writing of this paper, and when conducting the research described herein. There are no ethics concerns which need to be highlighted. 8 REPRODUCIBILITY We use both synthetic data and open-source molecular datasets in our experiments. Details of our experiment procedures and implementations have been provided in Section 3.5, Section 4.1, Section 6 and Appendix 14. 9 ACKNOWLEDGEMENTS Suvrit Sra Acknowledges generous support from the Alexander von Humboldt Foundation s AI Professorship. The authors thank Aaron Schild for the many valuable discussions, and Professor Xavier Bresson from the National University of Singapore for sharing the code which facilitated some of the experiments in this work. Published as a conference paper at ICLR 2025 AH Bentbib and A Kanber. Block power method for svd decomposition. Analele stiin tifice ale Universit a tii" Ovidius" Constan ta. Seria Matematic a, 23(2):45 58, 2015. Mitchell Black, Zhengchao Wan, Gal Mishne, Amir Nayyeri, and Yusu Wang. Comparing graph transformers via positional encodings. ar Xiv preprint ar Xiv:2402.14202, 2024. Thomas Bühler and Matthias Hein. Spectral clustering based on the graph p-laplacian. In Proceedings of the 26th annual international conference on machine learning, pp. 81 88, 2009. François Charton. Linear algebra with transformers. ar Xiv preprint ar Xiv:2112.01898, 2021. Krzysztof Choromanski, Han Lin, Haoxian Chen, Tianyi Zhang, Arijit Sehanobish, Valerii Likhosherstov, Jack Parker-Holder, Tamas Sarlos, Adrian Weller, and Thomas Weingarten. From blocktoeplitz matrices to differential equations on graphs: towards a general theory for scalable masked transformers. In International Conference on Machine Learning, pp. 3962 3983. PMLR, 2022. Ronald R Coifman and Stéphane Lafon. Diffusion maps. Applied and computational harmonic analysis, 21(1):5 30, 2006. Vijay Prakash Dwivedi and Xavier Bresson. A generalization of transformer networks to graphs. ar Xiv preprint ar Xiv:2012.09699, 2020. Vijay Prakash Dwivedi, Anh Tuan Luu, Thomas Laurent, Yoshua Bengio, and Xavier Bresson. Graph neural networks with learnable structural and positional representations. ar Xiv preprint ar Xiv:2110.07875, 2021. Vijay Prakash Dwivedi, Chaitanya K Joshi, Anh Tuan Luu, Thomas Laurent, Yoshua Bengio, and Xavier Bresson. Benchmarking graph neural networks. Journal of Machine Learning Research, 24(43):1 48, 2023. Moshe Eliasof, Fabrizio Frasca, Beatrice Bevilacqua, Eran Treister, Gal Chechik, and Haggai Maron. Graph positional encoding via random feature propagation. In International Conference on Machine Learning, pp. 9202 9223. PMLR, 2023. John J Irwin, Teague Sterling, Michael M Mysinger, Erin S Bolstad, and Ryan G Coleman. Zinc: a free tool to discover chemistry for biology. Journal of chemical information and modeling, 52 (7):1757 1768, 2012. Jinwoo Kim, Saeyoon Oh, and Seunghoon Hong. Transformers generalize deepsets and can be extended to graphs & hypergraphs. Advances in Neural Information Processing Systems, 34: 28016 28028, 2021. Jinwoo Kim, Dat Nguyen, Seonwoo Min, Sungjun Cho, Moontae Lee, Honglak Lee, and Seunghoon Hong. Pure transformers are powerful graph learners. Advances in Neural Information Processing Systems, 35:14582 14595, 2022. Devin Kreuzer, Dominique Beaini, Will Hamilton, Vincent Létourneau, and Prudencio Tossou. Rethinking graph transformers with spectral attention. Advances in Neural Information Processing Systems, 34:21618 21629, 2021. Derek Lim, Joshua Robinson, Lingxiao Zhao, Tess Smidt, Suvrit Sra, Haggai Maron, and Stefanie Jegelka. Sign and basis invariant networks for spectral graph representation learning. ar Xiv preprint ar Xiv:2202.13013, 2022. Liheng Ma, Chen Lin, Derek Lim, Adriana Romero-Soriano, Puneet K Dokania, Mark Coates, Philip Torr, and Ser-Nam Lim. Graph inductive biases in transformers without message passing. In International Conference on Machine Learning, pp. 23321 23337. PMLR, 2023. Grégoire Mialon, Dexiong Chen, Margot Selosse, and Julien Mairal. Graphit: Encoding graph structure in transformers. ar Xiv preprint ar Xiv:2106.05667, 2021. Luis Müller, Mikhail Galkin, Christopher Morris, and Ladislav Rampášek. Attending to graph transformers. ar Xiv preprint ar Xiv:2302.04181, 2023. Published as a conference paper at ICLR 2025 Wonpyo Park, Woonggi Chang, Donggeon Lee, Juntae Kim, and Seung-won Hwang. Grpe: Relative positional encoding for graph transformer. ar Xiv preprint ar Xiv:2201.12787, 2022. Richard Peng and Daniel A Spielman. An efficient parallel solver for sdd linear systems. In Proceedings of the forty-sixth annual ACM symposium on Theory of computing, pp. 333 342, 2014. Raghunathan Ramakrishnan, Pavlo O Dral, Matthias Rupp, and O Anatole Von Lilienfeld. Quantum chemistry structures and properties of 134 kilo molecules. Scientific data, 1(1):1 7, 2014. Ladislav Rampášek, Michael Galkin, Vijay Prakash Dwivedi, Anh Tuan Luu, Guy Wolf, and Dominique Beaini. Recipe for a general, powerful, scalable graph transformer. Advances in Neural Information Processing Systems, 35:14501 14515, 2022. Yu Rong, Yatao Bian, Tingyang Xu, Weiyang Xie, Ying Wei, Wenbing Huang, and Junzhou Huang. Self-supervised graph transformer on large-scale molecular data. Advances in neural information processing systems, 33:12559 12571, 2020. Lars Ruddigkeit, Ruud Van Deursen, Lorenz C Blum, and Jean-Louis Reymond. Enumeration of 166 billion organic small molecules in the chemical universe database gdb-17. Journal of chemical information and modeling, 52(11):2864 2875, 2012. Imanol Schlag, Kazuki Irie, and Jürgen Schmidhuber. Linear transformers are secretly fast weight programmers. In International Conference on Machine Learning, pp. 9355 9366. PMLR, 2021. Jianbo Shi and Jitendra Malik. Normalized cuts and image segmentation. In Proceedings of IEEE computer society conference on computer vision and pattern recognition, pp. 731 737. IEEE, 1997. Balasubramaniam Srinivasan and Bruno Ribeiro. On the equivalence between positional node embeddings and structural graph representations. In International Conference on Learning Representations, 2020. URL https://openreview.net/forum?id=SJxz Fy SKw H. Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Advances in neural information processing systems, pp. 5998 6008, 2017. URL http://arxiv.org/abs/1706.03762. Ameya Velingker, Ali Sinop, Ira Ktena, Petar Veliˇckovi c, and Sreenivas Gollapudi. Affinity-aware graph networks. Advances in Neural Information Processing Systems, 36, 2024. Johannes Von Oswald, Eyvind Niklasson, Ettore Randazzo, João Sacramento, Alexander Mordvintsev, Andrey Zhmoginov, and Max Vladymyrov. Transformers learn in-context by gradient descent. In International Conference on Machine Learning, pp. 35151 35174. PMLR, 2023. Bingbing Xu, Huawei Shen, Qi Cao, Keting Cen, and Xueqi Cheng. Graph convolutional networks using heat kernel for semi-supervised learning. ar Xiv preprint ar Xiv:2007.16002, 2020. Chengxuan Ying, Tianle Cai, Shengjie Luo, Shuxin Zheng, Guolin Ke, Di He, Yanming Shen, and Tie-Yan Liu. Do transformers really perform badly for graph representation? Advances in neural information processing systems, 34:28877 28888, 2021. Seongjun Yun, Minbyul Jeong, Raehyun Kim, Jaewoo Kang, and Hyunwoo J Kim. Graph transformer networks. Advances in neural information processing systems, 32, 2019. Bohang Zhang, Shengjie Luo, Liwei Wang, and Di He. Rethinking the expressive power of gnns via graph biconnectivity. ar Xiv preprint ar Xiv:2301.09505, 2023. Published as a conference paper at ICLR 2025 10 PROOFS FOR SECTION 3 10.1 ELECTRIC FLOW Proof of Primal Dual Equivalence for Electric Flow By convex duality, the primal problem in (4) can be reformulated as min f:BR1/2f=ψ j=1 rjf 2 j = min f max ϕ f Rf 2ϕ (BR1/2f ψ) = max ϕ min f f Rf 2ϕ (BR1/2f ψ) = max ϕ ϕ BB ϕ + 2ϕ ψ = min ϕ Rn ϕ Lϕ 2ϕ ψ (8) Proof of Lemma 1 Let Fψ(ϕ) := 1 2ϕ Lϕ ϕ ψ denote (1/2 times) the dual optimization objective from (5). Its gradient is given by Fψ(ϕ) = Lϕ ψ. Let Zl denote the output after l layers of the Transformer (3), i.e. Zl evolves according to (3): Zl+1 := Zl + W V l Zl Z l W Q l W K l Zl + W R l Zl. We use B l to denote the first d rows of Zl, Λ l to denote the (d + 1)th row to (d + k)th row of Zl and Φ l to denote the last k rows of Zl, i.e. Zl = B l Λ l Φ l . For i = 1...k, let ϕl,i denote the ith column of Φl. Then there exists a choice of W V , W Q, W K, W R, such that for all i, ϕl+1,i = ϕl,i δ Fψi(ϕl,i). (9) Before proving (9), we first consider its consequences. We can verify that 2Fψ( ) = L, which is in turn upper and lower bounded by λmin L λmax. By standard analysis of gradient descent for strongly convex and Lipschitz smooth functions, for δ < 1 λmax , we verify that for all i = 1...k, ϕL,i L ψi 2 2 = ϕL,i ϕ i 2 2 (1 δλmin) ϕL 1,i ϕ i 2 2 e δLλmin ϕ0,i ϕ i 2 2 =e δLλmin ϕ i 2 2 exp ( δLλmin) where ϕ i := arg minϕ Fψi(ϕ) = L ψi. As a final note, F is only weakly convex along the 1 direction, but we can ignore this because ϕl,i will always be orthogonal to 1 as long as ϕ0,i is (as we assumed in the lemma statement). The remainder of the proof will be devoted to showing (9). Recall that the input to the Transformer Z0 is initialized as Z0 = R(d+2k) n. For some fixed δ, our choice of parameter matrices Published as a conference paper at ICLR 2025 will be the same across layers: "0d d 0d k 0d k 0k d 0k k 0k k 0k d 0k k δIk k , W Q l W K l = "Id d 0d k 0d k 0k d 0k k 0k k 0k d 0k k 0k k "0d d 0d k 0d k 0k d 0k k 0k k 0k d δIk k 0k k By the choice of W V , W R in (10), we verify that for all layers l, Bl = B0 = B, so that Z l (W Q l ) W K l Zl = L. Again by choice of W V , W R in (10), Λl = Ψ for all l. It thus follows from induction that Φl+1 = Φl δLΦl + δΨ (11) ϕl+1,i = ϕl,i δLϕl,i + δψi = ϕl,i δ Fψi(ϕl,i) for i = 1...k, thus proving (9). 10.2 RESISTIVE EMBEDDING Proof of Lemma 2 Let ˆI := In n 1 n 1 1 . Let δ := 1/λmax. Consider the matrix power series, which converges under our choice of δ: r ˆI (ˆI δL) = where the second inequality uses the fact that (I δL)ˆI = ˆI δL ˆI. We define αl := "0d d 0d k 0d k 0k d δIk k 0k k 0k d 0k k 0k k , W Q l W K l = "Id d 0d k 0d k 0k d 0k k 0k k 0k d 0k k 0k k "0d d 0d k 0d k 0k d 0k k 0k k 0k d αl Ik k 0k k Define Zl =: . We verify that for all l, Bl = B, and Λl+1 = (I δL)Λl = (I δL)l Φl+1 = Φl + αlΛl = i=0 αi(I δL)i ˆIΨ, where we use the fact that Ψ = ˆIΨ by assumption. Therefore, [Z] L,i i=L αi(ˆI δL)i 2 ψi 2 By upper and lower bounds of Stirling s formula 2πLLLe L L! e LLLe L, we can bound αL p δ/L. Using the fact that (ˆI δL) (I δλmin)ˆI e δλmin ˆI, and using the bound on geometric sum, the above is bounded by e Lδλmin δL ψi 2. The conclusion follows by plugging in δ = 1/λmax. 10.3 HEAT KERNEL Proof of Lemma 3 The power series for e s L is given by Published as a conference paper at ICLR 2025 We define αl := ( s)l l! . For each layer l, define "0d d 0d k 0d k 0k d Ik k 0k k 0k d 0k k 0k k , W Q l W K l = "Id d 0d k 0d k 0k d 0k k 0k k 0k d 0k k 0k k "0d d 0d k 0d k 0k d Ik k 0k k 0k d αl Ik k 0k k Define Zl =: . We verify that for all l, Bl = B, and Λl+1 = LΛl = Ll+1Ψ Φl+1 = Φl + αlΛl = i=0 αi LiΨ. Therefore, [Z] L,i e s Lψi 2 i=L+1 αi Li 2 ψi 2 By lower bound of Stirling s formula, L! (L/e)L. Therefore, for L 8sλmax, we have αLLL 2 2 L+8sλmax. For L 2, the infinite sum is within a factor 2 of the first term, thus P i=L+1 αi Li 2 2 L+8sλmax+1. 10.4 FASTER CONSTRUCTIONS Proof of Lemma 4 We begin by recalling the Taylor expansion of L : (recall that ˆI := In n 1 1 /n). Our proof of this section uses a simple but powerful alternative expansion (see for example Peng & Spielman (2014)): for any integer t, I + ˆI δL 2l = δ l=0 (ˆI δL)l. (13) Notice that t terms in the LHS equals 2t terms of the RHS. This is exactly why we will get a doubly exponential convergence. As seen in the proof of Lemma 1, each term in the RHS of (13) coincides with one step of gradient descent. Therefore, if one layer of the Transformer can implement one additional product on the LHS, a L layer Transformer can efficiently implement 2L steps of gradient descent. In the remainder of the proof, we will show exactly this. For all layers l, let "In n 0n n 0n n 0n n 0n n 0n n In n 0n n 0n n , W Q l W K l = "0n n 0n n 0n n In n 0n n 0n n 0n n 0n n 0n n " In n 0n n 0n n 0n n 0n n 0n n 0n n 0n n 0n n Let Z l =: [Γl, Λl, Φl], where Γl, Λl, Φl Rn n. Under the configuration of weight matrices above, we verify that Λl = In n for all l, and thus Z l W Q l W K l Zl = ΛlΓ l = Γl. for all l. Next, we verify by induction that Γl+1 = Γl Γl + Γ l Γl = Γ l Γl = ˆI δL 2l . Note that Γl is symmetric. Finally, we verify that Φl+1 = Φl + Γ l Φl = I + ˆI δL 2l Φl. Thus by induction, I + (ˆI δL)2i . Published as a conference paper at ICLR 2025 Finally, again using (13), the residual term is given by Noting that (ˆI δL) (1 δλmin L)ˆI, we can bound L ΦL 2 exp 2Lδλmin Proof of Lemma 5 Let C := 3L. We will use the bound (I s L/C)C e s L (I s L/C)C I s2L2/C 1 (I s L/C)C I + 2s2L2/C , where the second inequality holds by our assumption that sλmax < 1/C. Therefore, (I s L/C)C exp( s L) 2 2s2L2 C exp( s L) 2 2s2λ2 max C . We will now show that ZL = (I s L/C)C. Let us define, for all l, W V l = In n, W Q l W K l = In n, W R l = In n. Then we verify that Zl+1 = Zl Zl + Zl Z l Zl = Z3 l (by symmetry of Zl). Thus by induction, Zl = Z3l 0 = (I s L/C)3l = (I s L/C)C. 11 PROOFS FOR SECTION 4 Lemma 8 (Single Index Orthogonalization) Consider the same setup as Lemma 6, with Transformer defined in (6). Let ϕl,j Rn denote the jth column of Φl. For any i = 1...k, there exists a choice of W V , W Q, W K, W R, such that ˆϕl+1,i = ϕl,i j=i+1 ϕl,i, ϕl,j ϕl,j, ϕl+1,i = ˆϕl+1,i ˆϕl+1,i 2 and for any j = i, ϕl+1,j = ϕl,j. Proof of Lemma 8 Recall that Z l =: [Bl, Φl]. Let A Rk k denote the matrix where Aii = 1 and is 0 everywhere else. Let H denote the matrix where Hjj = 1 for all j > i, and is 0 everywhere else. Let the weight matrices be defined as W V l = 0d d 0d k 0k d A , W Q l W K l = 0d d 0d k 0k d H Under the above definition, B l = B is a constant across all layers l. We verify that Φl evolves as Φ l+1 = Φ l AΦ l Φl HΦ l . We verify that the effect of left-multiplication by A "selects the ith row of Φ l Φl" and the effect of right-multiplication by H "selects the (i + 1)th...kth columns of Φ l Φl". Thus 0 ... 0 ... 0 ... ... ... 0 ... ϕl,i, ϕl,j+1 ... ϕl,i, ϕl,k ... ... ... 0 ... 0 ... 0 The conclusion can be verified by right-multiplying the above matrix with Φ l , and then applying columnn-wise normalization to Φl (or equivalently, applying row-wise normalization to [Z]d...d+k), as described in (6). Published as a conference paper at ICLR 2025 Proof of Lemma 6 Recall that Zl =: BT l Φ l . Assume for now that Bl = B for all layers l. Let ϕl,i denote the ith column of Φl. Let l be some fixed layer with weights W V l = 0d d 0d k 0k d Ik k , W Q l W K l = Id d 0d k 0k d 0k k , W R l = Id d 0d k 0k d Ik k Combined with (6), these imply that ˆΦl+1 = LΦl, and that ϕl+1,i = ϕl+1,i/ ϕl+1,i 2. This implements the first step in the loop of Algorithm 1. Although Algorithm 1 does not contain the normalization step, the result is identical, because QR(Φ) is invariant to column-scaling of Φ. We now show a (different) configuration of weight matrices which enable a sequence of layers to perform QR decomposition: Consider the construction in Lemma 8. For each i, we can use a single layer to make ϕl+1,i orthogonal with to ϕl,j for all j > i. By putting k such layers in sequence, we can ensure that ϕl+k,i is orthogonal to ϕl+k,j for all i = 1...k and for all j = i + 1...k. Thus Φl+k is exactly the column-orthogonal matrix in a QR decomposition of Φl. Finally, we observe that Bl is unchanged in both (14) and in the construction of Lemma 8. Thus we verify the assumption at the beginning of the proof that Bl = B for all l. This concludes the proof. Proof of Corollary 7 The construction is almost identical to that in the proof of Lemma 6 above. The only difference is that we replace the weight configuration in (14) by W V l = 0d d 0d k 0k d Ik k , W Q l W K l = Id d 0d k 0k d 0k k , W R l = Id d 0d k 0k d (µ 1)Ik k where the change is highlighted in red. Under this, we verify that ˆΦl+1 = (µI L)Φl. The remainder of the proof, including construction for the QR(Φ) operation, are unchanged from Lemma 6. 12 LEMMAS AND PROOFS FOR SECTION 5 Lemma 9 (Constructions under (7)) The constructions in Lemmas 1, 2, 3 and 6 can be realized within the constrained dynamics (7). Proof of Lemma 9 In general, we verify that (7) is equivalent to (3) with weights satisfying the following form: W V l = αV l Id d 0d 2k 02k d W V,Φ l αQ l Id d 0d 2k 02k d W Q,Φ l , W K l = αK l Id d 0d 2k 02k d W K,Φ l W R l = αR l Id d 0d 2k 02k d W R,Φ l Below, we state the weight configurations for (7) which will recover the constructions in each of the stated lemmas. Note that there is a small change in notation: Φl as defined in Section 5 corresponds to [Λl; Φl] from the proofs of Lemmas 1, 2 and 3. We leave the simple verification of this equivalence to the reader. The construction in Lemma 1 is equivalent to (7) with weight configuration αV l = 0, αQ l = αK l = 1, αR l = 0, W V,Φ l = 0k k 0k k δIk k 0k k , W Q,Φ l = W K,Φ l = 0, W R l = 0k k 0k k δIk k 0k k The construction in Lemma 2 is equivalent to (7) with weight configuration αV l = 0, αQ l = αK l = 1, αR l = 0, W V,Φ l = 1 λmax Ik k 0k k 0k k 0k k , W Q,Φ l = W K,Φ l = 0, W R l = 0k k 0k k 1 λmax Ik k 0k k Published as a conference paper at ICLR 2025 The construction in Lemma 3 is equivalent to (7) with weight configuration αV l = 0, αQ l = αK l = 1, αR l = 0, W V,Φ l = Ik k 0k k 0k k 0k k , W Q,Φ l = W K,Φ l = 0, W R l = Ik k 0k k ( s)l l! Ik k 0k k , where s is the temperature parameter. The construction in Lemma 6 is equivalent to (7) with the following weight configurations: 1. To implement the first step inside the loop of Algorithm 1, let αV l = 0, αQ l = αK l = 1, αR l = 0, W V,Φ l = 0k k 0k k 0k k Ik k , W Q,Φ l = W K,Φ l = 0, W R l = 0k k 0k k 0k k Ik k 2. To implement the QR decomposition step inside the loop of Algorithm 1, we will need to an equivalent construction as in Lemma 8. Let αV l = 0, αQ l = αK l = 1, αR l = 0, and let W V,Φ l = 0k k 0k k 0k k A , W Q,Φ l W K,Φ l = 0k k 0k k 0k k H , W R l = 0, where A and H are as defined in the proof of Lemma 8. Lemma 10 (Invariance and Equivariance) Let TFB l and TFΦ l be as defined in Section 5. Let U Rd d be any permutation matrix. Then for all layers l, TFB l (BU, Φ) = U TFB l (B, Φ) TFΦ l (BU, Φ) = TFΦ l (B, Φ) (15) Proof We will prove these two claims by induction simultaneously. Recall from (7) that B l+1 = (1 + αR,l)B l + αV,l B l αQ,lαK,l Bl B l + Φl W Q,Φ l W K,Φ l Φ l Φ l+1 = I + W R,Φ l Φ l + W V,Φ l Φ l αQ,lαK,l Bl B l + Φl W Q,Φ l W K,Φ l Φ l . Recall from the definition that TFB 0 (B, Φ) := B0 := B and TFΦ l (B, Φ) := Φ0 := Φ. Thus (15) holds by definition. Now suppose (15) holds for some l. By (7), TFB l+1(BU, Φ) =(1 + αR,l)TFB l (BU, Φ) + αV,l TFB l (BU, Φ) αQ,lαK,l TFB l (BU, Φ) TFB l (BU, Φ) + αV,l TFB l (BU, Φ) TFΦ l (BU, Φ) W Q,Φ l W K,Φ l TFΦ l (BU, Φ) =(1 + αR,l)U TFB l (B, Φ) + αV,l U TFB l (B, Φ) αQ,lαK,l TFB l (B, Φ) TFB l (B, Φ) + αV,l U TFB l (B, Φ) TFΦ l (B, Φ) W Q,Φ l W K,Φ l TFΦ l (B, Φ) =U TFB l+1(B, Φ), where we use the fact that UU = Id d. By similar steps as above, we also verify that TFΦ l+1(BU, Φ) = TFΦ l+1(B, Φ). This concludes the proof. 13 DETAILS FOR SYNTHETIC EXPERIMENTS We consider two ways of sampling random graphs: 1. Fully Connected (FC) graphs: n = 10 nodes and d = 45 edges. Published as a conference paper at ICLR 2025 2. Circular Skip Links (CSL) graphs: n = 10 nodes and d = 20 edges. The skip-length is sampled uniformly at random from {2, 4, 6, 8}. See (Dwivedi et al., 2023) for a detailed definition of CSL graphs. For both FC and CSL graphs, the resistance of an edge e is r(e) = eu(e), where u(e) is independently sampled uniformly from [ 2, 2]. 14 DETAILS FOR MOLECULAR REGRESSION EXPERIMENT Below, we provide various details for the molecular regression experiment in Section 6. 14.1 DATASET DETAILS For QM9, the training set size is 20,000, the validation set size is 2000, the test set size is 100,000. The training/validation set are subsampled from the full training/validation set. The average number of nodes is 8.79, and the average number of edges is 18.8. For ZINC, the training set size is 20,000, the validation set size is 2000, the test set size is 24,445. The training/validation set are subsampled from the full training/validation set. The average number of nodes is 23.16, and the average number of edges is 39.83. 14.2 ARCHITECTURE FOR MOLECULAR REGRESSION EXPERIMENT The linear Transformer we use for the experiment in Section 6 is described in (16) below: B l+1 = (1 + αR,l)B l + αV,l B l βl,1αQ,lαK,l D 1/2 l Bl B l D 1/2 l + βl,2Φl W Q,Φ l W K,Φ l Φ l Φ l+1 = I + W R,Φ l Φ l + W V,Φ l Φ l βl,3αQ,lαK,l D 1/2 l Bl B l D 1/2 l + βl,4Φl W Q,Φ l W K,Φ l Φ l Bl+1 Bl+1/ Bl+1 F ϕl+1,i ϕl+1,i/ ϕl+1,i 2 (for i=1...k) (16) It is similar to the memory-efficient Transformer architecture (7) from Section 5, with a number of modifications which we explain below. Let the input to the transformer be Z0 Rd+k, where d denotes the number of edges, and k is the dimension of learned features. 1. Scaling by D 1: The GT in Dwivedi & Bresson (2020) uses eigenvectors of the normalized Laplacian L := D 1/2LD 1/2, where D is the diagonal matrix whose ith diagonal entry is given by [D]ii := |Lii| = Pd j=1 |Bij|. To be consistent with this setup, we modify the dy- namics in (7) to add the scaling by D 1/2 l on the part of the self-similarity matrix involving Bl, highlighted in red in (16), where Dl is the diagonal matrix with the ith diagonal entry given by j=1 |Bl|ij. With this additional scaling in the dynamics, the same weight constructions in all the lemmas in this paper will compute the corresponding quantities for the normalized Laplacian L instead. For instance, with the D 1/2 l scaling, the construction in Lemma 1 computes L , and the construction in Lemma 6 computes the top-k eigenvectors of L. This can be verified using the following two facts: First, in all our constructions, Φl interacts with Bl only via the self-similarity matrix Bl B l . Second, D 1/2BB D 1/2 = L. 2. Independently scaled similarity matrix: For each layer l, we introduce additional scalar parameters βl,1, βl,2, βl,3, βl,4. One layer of (16) is more expressive than one layer of (7) but less expressive than two layers of (7) (ignoring the difference due to D 1/2). 3. Diagonal constraints on W: We constrain W V,Φ, W Q,Φ, W K,Φ to be diagonal, and W R,Φ to be a scaling of identity. Published as a conference paper at ICLR 2025 4. Weight sharing: Layers 3l, 3l + 1, 3l + 2 share the same parameters, for all integers l. Phrased another way, each layer is looped 3 times. 5. Per-layer Scaling: after each layer, we scale Bl to have Frobenius norm 1, and we scale each column of Φl to have Euclidean norm 1. Items 2-4 are useful for reducing the parameter count and improving generalization. Item 5 makes training feasible for deeper layers. The construction in Lemma 4 for subspace iteration can still be realised under the changes in items 2-5. Under the change in item 1, the construction in Lemma 4 will find eigenvectors of L instead of L. 14.3 EXPERIMENT DETAILS The GT we use has 128 hidden dimensions, 8 heads, and 4 layers. The position encoding dimension is 3 for QM9, and 6 for ZINC. The linear Transformer we use contain L = 9 layers, with parameters shared every 3 layers, as described in Section 14.2. The dimension of Φl is 8. We use a linear map, with parameters M, to project ΦL down to the PE dimension (PE_dim=3 for QM9 and PE_dim=6 for ZINC). This output is then passed to the GNN in the same way as the Lap PE. To clarify the distinction among the three models, we provide pseudo-code for Graph Transformer, Graph Transformer + Lap PE and Graph Transformer + linear Transformer below: Let GT denote the graph transformer from (Dwivedi & Bresson, 2020). Let G denote the collection of basic graph information, including edge list, edge features, and node features. Let Lap PE(G) return the Laplacian eigenvector positional encoding for G. Let B be the incidence matrix of G. Let LT denote the linear Transformer, which takes B as input (Φ0 can be viewed as internal parameters of LT) Then the predictions for each model in Table 2 are given by Graph Transformer GT(G, None) Graph Transformer + Lap PE GT(G, Lap PE(G)) Graph Transformer + linear Transformer GT(G, linear(M, LT(B))) (17) The trainable parameters are given by {all the parameters of GT} + {all the parameters of LT} ={all the parameters of GT} + {M, Φ0, (α s, β s, W s from (16))}. Before training on the actual regression task, we pretrain the linear Transformer to return the top PE_dim Laplacian eigenvectors. This gives a good initialization, and improves the optimization landscape. We train using Adam W. Graph Transformer parameters are updated with initial 0.001 lr. Linear Transformer parameters are updated with initial 0.01 lr. For ZINC, we train for 2000 epochs with lr halved every 800 epochs. For QM9, we train for 1000 epochs with lr halved ever 400 epochs. The means and standard deviations in Table 2 are each computed over 4 independent seeded runs. Published as a conference paper at ICLR 2025 15 EXPERIMENTS FOR EFFICIENT IMPLEMENTATION 1 3 5 7 9 # Layer 1 3 5 7 9 # Layer 1 3 5 7 9 # Layer 1 3 5 7 9 # Layer 3(a) loss L (electric) 3(b) loss L (resistive) 3(c) lossexp ( 0.5L) (heat) 3(d) Eigenvector Figure 3: Plot of loss against number of layers for the 4 problems. Figures {3(a), 3(b), 3(c), 3(d)} correspond to Figures {1(a),1(b),1(c),2(d)} respectively. The experiment setup of each corresponding pair of plots are identical, except for the architecture used: all plots in Figure 3 are made using the efficient implementation described in Section 5.