# pure_transformers_are_powerful_graph_learners__5ac7da01.pdf Pure Transformers are Powerful Graph Learners Jinwoo Kim1 Tien Dat Nguyen1 Seonwoo Min2 Sungjun Cho2 Moontae Lee2,3 Honglak Lee2 Seunghoon Hong1,2 1KAIST 2LG AI Research 3University of Illinois Chicago We show that standard Transformers without graph-specific modifications can lead to promising results in graph learning both in theory and practice. Given a graph, we simply treat all nodes and edges as independent tokens, augment them with token embeddings, and feed them to a Transformer. With an appropriate choice of token embeddings, we prove that this approach is theoretically at least as expressive as an invariant graph network (2-IGN) composed of equivariant linear layers, which is already more expressive than all message-passing Graph Neural Networks (GNN). When trained on a large-scale graph dataset (PCQM4Mv2), our method coined Tokenized Graph Transformer (Token GT) achieves significantly better results compared to GNN baselines and competitive results compared to Transformer variants with sophisticated graph-specific inductive bias. Our implementation is available at https://github.com/jw9730/tokengt. 1 Introduction In recent years, Transformer [68] has served as a versatile architecture in a broad class of machine learning problems, such as natural language processing [17, 7], computer vision [18], and reinforcement learning [9], to name a few. It is because the fully-attentional structure of Transformer is general and powerful enough to take, process, and relate inputs and outputs of arbitrary structures, eliminating a need for dataand task-specific inductive bias to be baked into the network architecture. Combined with large-scale training, it opens up a new chapter for building a versatile model that can solve a wide range of problems involving diverse data modalities and even a mixture of modalities [31, 30, 57]. In graph learning domain, inspired by the breakthroughs, multiple works tried combining selfattention into graph neural network (GNN) architecture where message passing was previously dominant [50]. As global self-attention across nodes cannot reflect the graph structure, however, these methods introduce graph-specific architectural modifications. This includes restricting self-attention to local neighborhoods [69, 51, 19], using global self-attention in conjunction with message-passing GNN [58, 43, 34], and injecting edge information into global self-attention via attention bias [72, 78, 29, 54]. Despite decent performance, such modifications can be a limiting constraint in terms of versatility, especially considering future integration to multi-task and multi-modal general-purpose attentional architectures [31]. In addition, deviating from pure self-attention, these methods may inherit the issues of message-passing such as oversmoothing [40, 8, 52], and become incompatible with useful engineering techniques e.g., linear attention [65] developed for standard self-attention. Instead, we explore the opposite direction of applying a standard Transformer directly for graphs. For this, we treat all nodes and edges as independent tokens, augment them with appropriate token-wise embeddings, and feed the tokens as input to the standard Transformer. The model operates identically to Transformers used in language and vision; each node or edge is treated as a token, identical to the words in a sentence or patches of an image [68, 18]. Perhaps surprisingly, we show that this simple approach yields a powerful graph learner both in theory and practice. Work done during an internship at LG AI Research. Equal corresponding authors. 36th Conference on Neural Information Processing Systems (Neur IPS 2022). Input Graph Node and Edge Tokens with Token-wise Embedding Transformer Encoder Node Identifiers Type Identifiers orthonormal [node] [edge] Tokenized Graph Transformer (Token GT) v4 1 2 3 4 1 1 1 2 2 2 2 3 3 3 3 4 4 4 4 2 v e v e v e v e v e Figure 1: Overview of Tokenized Graph Transformer (Token GT). We treat all nodes and edges of an input graph as independent tokens, augment them with orthonormal node identifiers and trainable type identifiers, and feed them to a standard Transformer encoder. For graph-level prediction, we follow the common practice [17, 18] of using an extra trainable [graph] token. As a key theoretical result, we prove that with appropriate token-wise embeddings, self-attention over the node and edge tokens can approximate any permutation equivariant linear operator on a graph [47]. Remarkably, we show that a very simple choice of embedding composed of node identifiers and type identifiers is sufficient for accurate approximation. This provides a solid theoretical guarantee that, with the embeddings and enough attention heads, a Transformer is at least as expressive as a second-order invariant graph network (2-IGN) [47, 34], which is already more expressive than all message-passing GNNs [21]. This also immediately grants the model with the expressive power at least as good as the 2-dimensional Weisfeiler-Lehman (WL) graph isomorphism test [46], which is often sufficient for real-world graph data [83]. We further extend our theoretical result to hypergraphs with order-k hyperedges, showing that a Transformer with order-k generalized token embeddings is at least as expressive as k-IGN and, consequently k-WL test. We test our model, named Tokenized Graph Transformer (Token GT), mainly on the PCQM4Mv2 large-scale quantum chemical property prediction dataset containing 3.7M molecular graphs [27]. Even though Token GT involves minimal graph-specific architectural modifications, it performs significantly better than all GNN baselines, showing that the advantages of Transformer architecture combined with large-scale training surpass the benefit of hard inductive bias of GNNs. Furthermore, Token GT achieves competitive performance compared to Transformer variants with strong graphspecific modifications [78, 29, 54]. Finally, we demonstrate that Token GT can naturally utilize efficient approximations in Transformers in contrast to these variants, using kernel attention [11] that enables linear computation cost without much degradation in performance. 2 Tokenized Graph Transformer (Token GT) In this section, we present the Tokenized Graph Transformer (Token GT), a pure Transformer architecture for graphs with token-wise embeddings composed of node identifiers and type identifiers (Figure 1). Our goal in this section is to provide a practical overview for theoretical analysis of the architecture, we guide the readers to Section 3. Let G = (V, E) an input graph with n nodes V = {v1, ..., vn} and m edges E = {e1, ..., em} V2, associated with features XV Rn C and XE Rm C, respectively. We treat each node and edge as an independent token (thus (n + m) tokens in total) and construct their features by X = [XV; XE] R(n+m) C. A naïve way to process a graph is to directly provide the tokens X as input to a Transformer, but it is inappropriate as graph connectivity is discarded. To thoroughly represent graph structure, we augment the tokens X with token-wise embeddings, more specifically orthonormal node identifiers used for representing the connectivity of the tokens and trainable type identifiers that encode whether a token is a node or an edge. Despite the simplicity, we show that a Transformer applied on these embeddings is a theoretically powerful graph learner. Node Identifiers The first component of token-wise embedding is the orthonormal node identifier that we use to represent the connectivity structure given in the input graph. For a given input graph G = (V, E), we first produce n node-wise orthonormal vectors P Rn dp that we refer to as node identifiers. Then, we augment the tokens X with node identifiers as follows. For each node v V, we augment the token Xv as [Xv, Pv, Pv]. For each edge (u, v) E, we augment the token X(u,v) as [X(u,v), Pu, Pv]. Intuitively, a Transformer operating on the augmented tokens can fully recognize the connectivity structure of the graph since comparing the node identifiers between a pair of tokens reveals their incidence information. For instance, we can tell if an edge e = (u, v) is connected with a node k through dot-product (attention) since [Pu, Pv][Pk, Pk] = 1 if and only if k (u, v) and 0 otherwise. This allows the Transformer to identify and exploit the connectivity structure of a graph, for instance by putting more weights on incident pairs when the local operation is important. Notably, as the node identifiers P are only required to be orthonormal, we have a large degree of freedom in implementation choices. We outline two practical methods below as examples. Their implementation details can be found in Appendix A.3.1. Orthogonal random features (ORFs), e.g., rows of random orthogonal matrix Q Rn n obtained with QR decomposition of random Gaussian matrix G Rn n [79, 12]. Laplacian eigenvectors obtained from eigendecomposition of graph Laplacian matrix, i.e., rows of U from = I D 1/2AD 1/2 = U ΛU, where A Rn n is adjacency matrix, D is degree matrix, and Λ, U correspond to eigenvalues and eigenvectors respectively [20]. Among the two methods, node identifiers generated as ORFs do not encode any information about the graph structure as they are entirely random. This means the Transformer that operates on the ORF-based node identifiers needs to compile and recognize graph structure only from the incidence information provided by the node identifiers. Although this is challenging, perhaps surprisingly, we empirically show in Section 5 that Transformers are strong enough to learn meaningful structural representations out of ORF-based node identifiers and outperform GNNs on large-scale task. In contrast to ORFs, Laplacian eigenvectors provide a kind of graph positional embeddings (graph PEs) that describes the distance between nodes on a graph. Due to the positional information, it yields better performance compared to ORFs in our experiments in Section 5. One interesting aspect of Laplacian eigenvectors is that they can be viewed as a generalization of sinusoidal positional embeddings of NLP Transformers to graphs, as the eigenvectors of 1D chain graphs are sine and cosine functions [20]. Thus, by choosing Laplacian eigenvectors as node identifiers, our approach can be interpreted as a direct extension of the NLP Transformer for inputs involving relational structures. Type Identifiers The second component of token-wise embedding is the trainable type identifier that encodes whether a token is node or edge. For a given input graph G = (V, E), we first prepare a trainable parameter matrix E = [EV; EE] R2 de that contains two type identifiers EV and EE for nodes and edges respectively. Then, we further augment the tokens with type identifiers as follows. For each node v V, we augment the token [Xv, Pv, Pv] as [Xv, Pv, Pv, EV]. For each edge (u, v) E, we augment the token [X(u,v), Pu, Pv] as [X(u,v), Pu, Pv, EE]. These embeddings provide information on whether a given token is a node or an edge, which is critical, e.g., when an attention head tries to attend specifically to node tokens and ignore edge tokens. Main Transformer With node identifiers and type identifiers, we obtain augmented token features Xin R(n+m) (C+2dp+de), which is further projected by a trainable matrix win R(C+2dp+de) d to be an input to Transformer. For graph-level prediction, we prepend a special token [graph] with trainable embedding X[graph] Rd similar to BERT [17] and Vi T [18]. We utilize the feature of [graph] token at the output of the encoder as the graph representation, on which a linear prediction head is applied to produce the final graph-level prediction. Overall, the tokens Z(0) = [X[graph]; Xinwin] R(1+n+m) d are used as the input to the main encoder. As an encoder, we adopt the standard Transformer [68], which is an alternating stack of multihead self-attention layers (MSA) and feedforward MLP layers. We provide further details in Appendix A.1.1. Inductive Bias Similar to Transformers in language and vision [17, 18], Tokenized Graph Transformer treats input nodes and edges as independent tokens and applies self-attention to them. This approach leads to much less inductive bias than current GNNs, where the sparse graph structure, or more fundamentally, permutation symmetry of graphs is deliberately baked into each layer [21, 47, 46, 34]. For Token GT, such information is provided entirely as a part of input using token-wise embeddings, and the model has to learn how to interpret and utilize the information from data. Although such weak inductive bias might raise questions on the expressiveness of the model, our theoretical analysis in Section 3 shows that Token GT is a powerful graph learner thanks to the token-wise embeddings and expressive power of self-attention. For example, we show that Token GT is more expressive than all message-passing GNNs under the framework of Gilmer et al. (2017) [21]. 3 Theoretical Analysis We now present our theory. Our key result is that Token GT, a standard Transformer with node and type identifiers presented in Section 2, is provably at least as expressive as the second-order Invariant Graph Network (2-IGN [47]), which is built upon all possible permutation equivariant linear layers on a graph. This provides solid theoretical guarantees for Token GT, such as being at least as powerful as the 2-WL graph isomorphism test and more expressive than all message-passing GNNs. Our theory is based on a general framework on hypergraphs represented as higher-order tensors, which leads to the formulation of order-k Token GT that is at least as expressive as order-k IGN (k-IGN [47]). 3.1 Preliminary: Permutation Symmetry and Invariant Graph Networks Representing and Processing Sets and (Hyper)Graphs For a set of n nodes, we often represent their features as X Rn d where Xi Rd is the feature of the i-th node. The set is unordered and, therefore, should be treated invariant to the renumbering of the nodes. Let Sn the symmetric group or the group of permutations π on [n] = {1, ..., n}. By π X we denote permuting rows of X with π, i.e., (π X)i = Xπ 1(i). Here, X and π X represent the identical set for all π Sn. Generally, we consider (hyper)graphs represented as order-k tensor X Rnk d with feature Xi = Xi1,...,ik Rd attached to (hyper)edge represented as multi-index i = (i1, ..., ik) [n]k. Similar to sets, the tensor should be treated invariant to node renumbering by any π Sn that acts on X by (π X)i = Xπ 1(i) where π 1(i) = (π 1(i1), ..., π 1(ik)). That is, X and π X represent the identical (hyper)graph for all π. Due to such symmetry, to build a function F(X) T for tensor X and target T, a suitable way is to make them invariant F(π X) = F(X) when the target is a vector or equivariant F(π X) = π F(X) when the target is also a tensor, for all X Rnk d and π Sn. In our theoretical analysis, we work on order-k dense tensor representation X Rnk d of a graph as they can represent node features (k = 1), edge features (k = 2), or hyperedge features (k > 2) in a unified manner. This is interchangeable but slightly different from the sparse representation of a graph with edge set E used in Section 2. Nevertheless, in Section 5 we empirically verify that our key theoretical findings work equally well for dense and sparse graphs. Invariant Graph Network We mainly develop our theoretical analysis upon Invariant Graph Networks (IGNs) [47, 46], a family of expressive graph networks derived from the permutation symmetry of tensor representation of graphs. Here we provide a summary. In general, we define: Definition 1. An order-k Invariant Graph Network (k-IGN) is a function Fk : Rnk d0 R written as the following: Fk = MLP Lk 0 L(T ) k k σ ... σ L(1) k k, (1) where each L(t) k k is equivariant linear layer [47] from Rnk dt 1 to Rnk dt, σ is activation function, and Lk 0 is a invariant linear layer from Rnk d T to R. A body of previous work have shown appealing theoretical properties of k-IGN, including universal approximation [48] and alignment to k-Weisfeiler-Lehman (k-WL) graph isomorphism test [46, 10]. In particular, it is known that k-IGNs are theoretically at least as powerful as the k-WL test [46]. It is also known that 2-IGNs are already more expressive [47, 34] than all message-passing GNNs under the framework of Gilmer et al. (2017) [21]. The core building block of IGN is invariant and equivariant linear layers [47] with maximal expressiveness while respecting node permutation symmetry. The layers are defined as follows: Definition 2. An equivariant linear layer is a function Lk l : Rnk d Rnl d written as follows for order-k input X Rnk d: Lk l(X)i = X j Bµ i,j Xjwµ + X λ Cλ i bλ, (2) where i [n]l, j [n]k are multi-indices, wµ Rd d , bλ Rd are weight and bias parameters, and Bµ Rnl+k and Cλ Rnl are binary basis tensors corresponding to order-(l + k) and order-l equivalence classes µ and λ, respectively. Invariant linear layer is a special case of Lk l with l = 0. We provide the definition of the equivalence classes and basis tensors in Appendix A.1.1. For now, it is sufficient to know that the basis tensors are binary tensors that form the orthogonal basis of the full space of linear equivariant layers. In general, in Eq. (2) it is known that there exists bell(k + l) number of basis tensors Bµ for the weight and bell(l) number of basis tensors Cλ for the bias. 3.2 Can Self-Attention Approximate Equivariant Basis? Now, we present an intuition that connects Transformer (Section 2) and equivariant linear layer (Definition 2). For that, we write out the multihead self-attention layer as follows: j αh ij Xjw V h w O h where αh = softmax Xw Q h (Xw K h ) d H where H is number of heads, d H is head size, and w Q h , w K h Rd d H, w V h Rd dv w O h Rdv d. Our intuition is that the weighted sum of values with self-attention matrix αh in Eq. (3) is analogous to the masked sum with basis tensor Bµ in Eq. (2) up to normalization. This naturally leads to the following question: for a given equivariant layer Lk k : Rnk d Rnk d, can we use a Transformer layer with multihead self-attention MSA : RN d RN d with N = nk to accurately approximate Lk k by having H = bell(2k) attention heads approximate each equivariant basis Bµ? We show that this can be possible, but only if we provide appropriate auxiliary information to input. For example, let us consider first-order layer L1 1. The layer has bell(2) = 2 basis tensors Bµ1 = I and Bµ2 = 11 I for the weight, and bell(1) = 1 basis tensor Cλ1 = 1 for the bias. Given an input set X Rn d it computes the following with w1, w2 Rd d, b Rd: L1 1(X) = IXw1 + (11 I)Xw2 + 1b . (4) Now consider approximating basis tensor Bµ1 = I with an attention matrix α1. The approximation is accurate when i-th query always only attends to i-th key and ignores the rest. To achieve the attention structure consistently, i.e., agnostic to input X, we need to provide auxiliary input that self-attention can "latch onto" to faithfully approximate α1 I. Without this, attention must entirely rely on the inputs X, which is unreliable and can lead to approximation failure, e.g., when X has repeated rows. For the auxiliary information, we prepare n node-wise orthonormal vectors P Rn dp (note that this is identical to node identifiers in Section 2), and augment the input to Xin = [X, P] Rn (d+dp). Let us assume that the query and key projections in Eq. (3) ignore X and only leave P scaled by a with a > 0. Then attention matrix is computed as α1 = softmax(S) where Sij = a P i Pj. Here, due to the orthonormality of P, we have P i Pj = 1 only if i = j and otherwise 0, which leads to S = a I. With a by scaling up the query and key projection weights, the softmax becomes arbitrarily close to the hardmax operator, and we obtain the following: α1 = softmax(a I) I as a . (5) Thus, self-attention can utilize the auxiliary information P to achieve an input-agnostic approximation of α1 to I. Notably, we can achieve a similar approximation for Bµ2 = 11 I using the same P by flipping the sign of keys, which gives α2 = softmax( a I) due to orthonormality. By sending a , now attention from the i-th query to the i-th key is suppressed, and we obtain the following: α2 = softmax ( a I) 1 n 1(11 I) as a . (6) Note that this approximation is accurate only up to row normalization as rows of α2 always sum to one due to softmax, while Bµ2 = 11 I is binary. In our proofs of the theoretical results, we perform appropriate denormalization with MLP after MSA to achieve an accurate approximation. Overall, we see that simple auxiliary input P suffices for two attention heads to approximate the equivariant basis of L1 1 accurately. We now question the following. Given appropriate auxiliary information as input, can a Transformer layer with bell(2k) attention heads accurately approximate Lk k by having each head approximate each equivariant basis Bµ? What would be the sufficient auxiliary input? We answer the question by showing that, with (order-k generalized) node and type identifiers presented in Section 2, Transformer layers can accurately approximate equivariant layers Lk k via input-agnostic head-wise approximation of each equivariant basis. 3.3 Pure Transformers are Powerful Graph Learners We now present our main theoretical results that extend the discussions in Section 3.2 to any order k. Note that k = 2 corresponds to Token GT for graphs presented in Section 2. With k > 2, we naturally extend Token GT to hypergraphs. All proofs can be found in Appendix A.1. We first introduce generalized node and type identifiers (Section 2) for order-k tensors X Rnk d. We define the node identifier P Rn dp as an orthonormal matrix with n rows, and the type identifier as a trainable matrix E Rbell(k) de that contains bell(k) rows Eγ1, ..., Eγbell(k), each of which is designated for an order-k equivalence class γ. Then, we augment each entry of input tensor as [Xi1,...,ik, Pi1, ..., Pik, Eγ] where (i1, ..., ik) γ. Let us exemplify. For k = 1 (sets), each i-th entry is augmented as [Xi, Pi, Eγ1], consistent with our discussion in Section 3.2. For k = 2 (graphs), each (i, i)-th entry is augmented as [Xii, Pi, Pi, Eγ1] and each (i, j)-th entry (i = j) is augmented as [Xij, Pi, Pj, Eγ2]. This is consistent with Token GT in Section 2, which augments nodes with EV = Eγ1 and edges with EE = Eγ2. With node and type identifiers, we obtain augmented order-k tensor Xin Rnk (d+kdp+de). We use a trainable projection win R(d+kdp+de) d T to map them to hidden dimension d T of a Transformer. We now show that self-attention on Xinwin can accurately approximate equivariant basis: Lemma 1. For all X Rnk d and their augmentation Xin, self-attention coefficients αh (Eq. (3)) computed with Xinwin can approximate any basis tensor Bµ Rn2k of order-k equivariant linear layer Lk k (Definition 2) to arbitrary precision up to normalization. Consequently, with the node and type identifiers, a collection of bell(2k) attention heads can approximate the collection of all basis tensors of order-k equivariant layer. This leads to the following: Theorem 1. For all X Rnk d and their augmentation Xin, a Transformer layer with bell(2k) self-attention heads that operates on Xinwin can approximate an order-k equivariant linear layer Lk k(X) (Definition 2) to arbitrary precision. While the approximation in Lemma 1 is only accurate up to normalization over inputs (keys) due to softmax normalization, for the approximation in Theorem 1 we perform appropriate denormalization using MLP after multihead self-attention and can obtain an accurate approximation. By extending the result to multiple layers, we arrive at the following: Theorem 2. For all X Rnk d and their augmentation Xin, a Transformer composed of T layers that operates on Xinwin followed by sum-pooling and MLP can approximate an k-IGN Fk(X) (Definition 1) to arbitrary precision. This directly leads to the following corollary: Corollary 1. A Transformer on node and type identifiers in Theorem 2 is at least as expressive as k-IGN composed of order-k equivariant linear layers. Corollary 1 allows us to draw previous theoretical results on the expressiveness of k-IGN [46, 47, 34] and use them to lower-bound the provable expressiveness of a standard Transformer: Corollary 2. A Transformer on node and type identifiers in Theorem 2 is at least as powerful as k-WL graph isomorphism test and is more expressive than all message-passing GNNs within the framework of Gilmer et al. (2017) [21]. 4 Related Work We outline relevant work including equivariant neural networks, theory on expressive power of Transformers and their connection to modeling equivariance, and Transformers for graphs. Equivariant Neural Networks A machine learning task is often invariant or equivariant to specific symmetry of input data, e.g., image classification is invariant to the translation of an input image. A large body of literature advocated baking the invariance or equivariance into a neural network as a type of inductive bias (e.g., translation equivariance of image convolution), showing that it reduces the number of parameters and improves generalization for a wide range of learning tasks involving various geometric structures [13, 14, 73, 66, 49, 53, 60, 6, 34, 39]. Ravanbakhsh et al. (2017) [56] showed that any equivariant layer for discrete group actions is equivalent to a specific parameter sharing structure. Zaheer et al. (2017) [82] and Maron et al. (2019) [47] derived the parameter sharing for node permutation-symmetric data (sets and (hyper)graphs), which gives the maximally expressive equivariant linear layers and k-IGN in Section 3.1. The work on equivariant neural networks underlie our theory of how a standard Transformer can be a powerful learner for sets and (hyper)graphs. Expressive Power of Transformers and Its Connection to Equivariance Recent work involving Transformers often focus on minimizing the domainand task-specific inductive bias and scaling the model and data so that any useful computation structure can be learned [18, 31, 30, 7, 17, 9, 39]. The success of this approach is, to some degree, attributed to the high expressive power of Transformers that allows learning diverse functions suited for the data at hand [81, 39, 3, 4, 41]. Recent theory has shown that Transformers are expressive enough to even model certain equivariant functions [1, 15, 39]. Andreoli et al. (2019) [1] cast self-attention and convolution into a unified framework using basis tensors similar to ones in Section 3.1. Cordonnier et al. (2020) [15] advanced the idea and showed that Transformers with relative positional encodings can approximate any image convolution layers. Lee et al. (2019) [39] and Kim et al. (2021) [34] showed that Transformers can model equivariant linear layers for sets [82], which can be viewed as the first-order case of our theory (see Section 3.2). To our knowledge, our work is the first to show that standard Transformers are expressive enough to provably model maximally expressive equivariant layers and k-IGN for (hyper)graphs with k 2. Transformers for Graphs Unlike in language and vision, developing Transformers for graphs is challenging due to (1) the presence of edge connectivity and (2) the absence of canonical node ordering that prevents adopting simple positional encodings [50]. To incorporate the connectivity of edges, early methods restricted self-attention to local neighborhoods (thus reducing to messagepassing) [19, 51, 69] or used global self-attention with auxiliary message-passing modules [58, 43]. As message-passing suffers from limited expressive power [77] and oversmoothing [40, 8, 52], recent works often discard them and use global self-attention on nodes with heuristic modifications to process edges [78, 29, 54, 38, 42]. Ying et al. (2021) [78] proposed to inject edge encoding based on shortest paths through self-attention bias. Kreuzer et al. (2021) [38] proposed to incorporate edges into self-attention matrix via elementwise multiplication. On the contrary, we leave the self-attention unmodified and provide both nodes and edges with certain token-wise embeddings (Section 2) as its input. To incorporate graph structure into nodes, on the other hand, some approaches focus on developing graph positional encoding, e.g., based on Laplacian eigenvectors [20, 42, 38]. While these can be directly incorporated into our work via auxiliary node identifiers for better performance, we leave this as future work. We further note that current graph Transformers that utilize Laplacian positional encoding rely heavily on heuristic edge encoding [29, 38] while ours does not. Another closely related approach is the Higher-order Transformer [34] which generalizes k-IGN with masked self-attention. While it is highly complex to implement due to hard-coded head-wise equivariant masks, our method can be implemented effortlessly using any available implementation of standard Transformer. Furthermore, our method is more flexible as the model can choose to use different attention heads to focus on a specific equivariant operator (e.g., local propagation) if needed. We further discuss the difficulty in applying linear attention to graph Transformers in Appendix A.2. 5 Experiments We first conduct a synthetic experiment that directly confirms our key claims in Lemma 1 (Section 3). Then, we empirically explore the capability of Tokenized Graph Transformer (Token GT) (Section 2) using the PCQM4Mv2 large-scale quantum chemistry regression dataset [27]. We further present experiments on transductive node classification datasets involving large graphs in Appendix A.4.3. Table 1: Second-order equivariant basis approximation. We report average and standard deviation of L2 error averaged over heads over 3 runs. For Random/ORF (first-order), we sample random embeddings independently for each token. dense input sparse input node id. type id. train L2 test L2 train L2 test L2 47.95 0.600 53.93 1.426 29.88 0.450 34.70 1.167 32.38 0.448 40.06 1.202 15.92 0.275 20.39 0.765 Random (first-order) 32.19 0.476 32.49 3.687 15.87 0.247 16.56 0.904 ORF (first-order) 32.35 0.369 39.87 1.263 15.87 0.247 16.56 0.908 Random 5.909 0.019 5.548 0.090 8.152 0.042 8.270 0.285 ORF 5.472 0.035 5.143 0.078 7.167 0.025 7.190 0.217 Laplacian eigenvector 1.899 3.050 1.702 2.912 0.288 0.019 0.064 0.010 Random 0.375 0.009 0.234 0.011 0.990 0.108 0.875 0.042 ORF 0.080 0.001 0.009 5e-5 0.129 0.002 0.011 0.002 Laplacian eigenvector 0.053 1.5e-5 0.005 1e-4 0.101 0.003 0.019 0.007 No embedding Type id. only Node id. only Node id. + type id. Equivariant basis Self-attention matrix αh Basis tensor Bµ Figure 2: Self-attention maps learned under various node and type identifier configurations for two target equivariant basis tensors (out of 15). For better visualization, we clamp the entries by 0.01. Self-attention learns acute patterns coherent to equivariant basis when orthonormal node identifiers and type identifiers are both provided as input. More images can be found in Appendix A.4.1. 5.1 Approximating Second-Order Equivariant Basis As in Theorem 1 and 2 (Section 3), our argument on the expressive power of Token GT relies on its capability to approximate order-k permutation equivariant linear layers Lk k (Definition 2). Specifically, Lemma 1 states that such capability depends on the ability of each self-attention head α1, ..., αH (Eq. (3)) to accurately approximate each equivariant basis Bµ1, ..., Bµbell(2k) (Definition 2) up to normalization. We verify this claim for k = 2 (second-order; graphs) in a synthetic setup using Barabási-Albert random graphs. We use a multihead self-attention layer (Eq. (3)) with bell(2 + 2) = 15 heads and explicitly supervise head-wise attention scores αh to approximate each (normalized) equivariant basis tensor Bµh by minimizing L2 loss. Having the layer hyperparameters fixed, we provide different combinations of node and type identifiers, and test if multihead self-attention can jointly approximate all 15 equivariant basis on unseen graphs. We experiment with both dense and sparse graph representations; for graphs with n nodes and m edges, the dense graph considers all n2 pairwise edges as input as in Section 3, whereas the sparse graph considers only the present m edges as in Section 2. Further details can be found in Appendix A.3.2. We outline the results in Table 1. Consistent with Lemma 1, self-attention achieves accurate approximation of equivariant basis only when both the orthonormal node identifiers and type identifiers are given. Here, Laplacian eigenvectors (Lap, ) often yield slightly better results than orthogonal random features (ORF, ) presumably due to less stochasticity. Interestingly, we see that self-attention transfers the learned (pseudo-)equivariant self-attention structure to unseen graphs near perfectly. Table 2: Results on PCQM4Mv2 large-scale graph regression benchmark. We report the Mean Absolute Error (MAE) on the validation set, and report MAE on the unavailable test set if possible. method # parameters valid MAE test-dev MAE asymptotics Message-passing GNNs GCN [27] 2.0M 0.1379 0.1398 O(n + m) GIN [27] 3.8M 0.1195 0.1218 O(n + m) GAT 6.7M 0.1302 N/A O(n + m) GCN-VN [27] 4.9M 0.1153 0.1152 O(n + m) GIN-VN [27] 6.7M 0.1083 0.1084 O(n + m) GAT-VN 6.7M 0.1192 N/A O(n + m) GAT-VN (large) 55.2M 0.1361 N/A O(n + m) Transformers with strong graph-specific modifications Graphormer [63] 48.3M 0.0864 N/A O(n2) EGT [29] 89.3M 0.0869 0.0872 O(n2) GRPE [54] 46.2M 0.0890 0.0898 O(n2) Pure Transformers Transformer 48.5M 0.2340 N/A O((n + m)2) Token GT (ORF) 48.6M 0.0962 N/A O((n + m)2) Token GT (Lap) 48.5M 0.0910 0.0919 O((n + m)2) Token GT (Lap) + Performer 48.5M 0.0935 N/A O(n + m) Token GT (Lap) Network depth (layer) 0 10 5 Token GT (ORF) Mean attention distance (hops) Network depth (layer) 0 10 5 Mean attention distance (hops) Figure 3: Attention distance by head and network depth. Each dot shows mean attention distance in hops across graphs of a head at a layer. The visualization is inspired by Dosovitskiy et al. (2020) [18]. More images can be found in Appendix A.4.2. Non-orthogonal random embeddings lead to inaccurate approximation (Random, ), highlighting the importance of orthogonality of node identifiers. The approximation is also inaccurate when we sample ORF Pt independently for each token t (ORF (first-order), ) instead of using concatenated node identifiers [Pu, Pv] for token (u, v). This supports our argument in Section 2 that the incidence information implicitly provided via node identifiers plays a key role in approximation. In Figure 2, we provide a visualization of self-attention maps learned under various node and type identifier choices. Additional results can be found in Appendix A.4.1. 5.2 Large-Scale Graph Learning An exclusive characteristic of Token GT is its minimal graph-specific inductive bias, which requires it to learn internal computation structure largely from data. As such models are commonly known to work well with large-scale data [68, 18], we explore the capability of Token GT on the PCQM4Mv2 quantum chemistry regression dataset [27], one of the current largest with 3.7M molecular graphs. For Token GT, we use both node and type identifiers, and use main Transformer encoder configuration based on Graphormer [78] with 12 layers, 768 hidden dimension, and 32 attention heads. We try both ORF and Laplacian eigenvector as node identifiers, and denote corresponding models as Token GT (ORF) and Token GT (Lap) respectively. As an ablation, we also experiment with the same Transformer without node and type identifiers, which we denote as Transformer. Finally, we apply the kernel attention [11] that approximates the attention computation to linear cost (Token GT (Lap) + Performer). We use Adam W optimizer with (β1, β2) = (0.99, 0.999) and weight decay 0.1, and 60k learning rate warmup steps followed by linear decay over 1M iteration with batch size 1024. For fine-tuning, we use 1k warmup, 0.1M training steps, and cosine learning rate decay. We train the models on 8 RTX 3090 GPUs for 3 days. Further details are in Appendix A.3.3. We provide the results in Table 2. A standard Transformer on the node and edge tokens cannot recognize graph structure and shows low performance (0.2340 valid MAE). Yet, the picture changes as soon as we augment the tokens with node and type identifiers. Notably, Token GT (ORF) achieves 0.0962 MAE, which is already better than all GNN baselines. This is a somewhat surprising result, as both ORF and the Transformer are not aware of graph structures. This implies Transformer is strong enough to learn to interpret and reason over the incidence structure of tokens provided only implicitly by the node and type identifiers. By further switching to Laplacian eigenvectors that encode position on graphs [20], we observe a performance boost to 0.0910 MAE, competitive to Transformers with sophisticated graph-specific modifications (e.g., shortest path-based spatial encoding [78]). While such methods inject graph structure into attention matrix via bias term and therefore strictly require O(n2) cost, Token GT enables adopting kernelization for pure self-attention [11], resulting in Token GT (Lap) + Performer with the best performance among O(n + m) models (0.0935 MAE). Further discussion on the empirical performance of Token GT can be found in Appendix A.5. While our theory in Section 3 guarantees that Token GT can reduce to an equivariant layer by learning fixed equivariant basis at each attention head, in practice, it can freely utilize multihead self-attention to learn less restricted and more useful computation structure from data. To analyze such a structure, we compute the attention distance across heads and network depth by averaging pairwise token distances on a graph weighted by their attention scores (Figure 3). This distance is analogous to the number of hops in message-passing. In both Token GT (ORF) and Token GT (Lap), in the lowest layers, some heads attend globally over the graph while others consistently have small receptive fields (acting like a local message-passing operator). In deeper layers, the attention distances increase, and most heads attend globally. Interestingly, this behavior is highly consistent with Vision Transformers on image patches [18], suggesting that hybrid architectures based on convolution to aid Vi T [16, 80] might also work well for graphs. While Token GT (ORF) shows relatively consistent attention distance over heads, Token GT (Lap) shows higher variance, implying that it learns more diverse attention patterns. Judging from the higher performance of Token GT (Lap), this suggests that the graph structure information of the Laplacian eigenvector facilitates learning useful and diverse attention structures, which calls for future exploration of better node identifiers based on graph PEs [38, 42]. 6 Conclusion We showed that Transformers directly applied to graphs can work well in both theory and practice. In the theoretical aspect, we proved that with appropriate token-wise embeddings, a Transformer on node and edge tokens is at least as expressive as k-IGN and k-WL test, making it more expressive than all message-passing GNNs. For such token-wise embeddings, we showed that a combination of simple orthonormal node identifiers and trainable type identifiers suffices, which we also verified with a synthetic experiment. In an experiment with PCQM4Mv2 large-scale dataset, we show that Tokenized Graph Transformer (Token GT) performs significantly better than all GNNs and is competitive with Transformer variants with strong graph-specific architectural components [78, 29, 54]. While the results suggest a promising research direction, there are challenges to be addressed in future work. First, treating each node and edge as tokens requires O((n + m)2) asymptotic cost due to the quadratic nature of self-attention. While we address this to some degree with kernelization and achieve O(n + m) cost, other types of efficient Transformers (e.g., sparse) that can deliver better performance are left to be tested. Another issue is slightly lower performance compared to the stateof-the-art. Adopting Transformer engineering techniques from vision and language domains, such as data scaling [7, 18], deepening [70, 74], hybrid architectures [16, 80], and self-supervision [17, 7, 24], are promising. In the societal aspect, to prevent the potential risky behavior in, e.g., decision making from graph-structured inputs, interpretability research regarding self-attention on graphs is desired. We finish with interesting research directions that stem from our work. As our approach advocates viewing a graph as (n + m) tokens [37], it opens up new paradigms of graph learning, including autoregressive decoding, in-context learning, prompting, and multimodal learning. Another interesting direction is to extend our theory and use self-attention to approximate equivariant basis for general discrete group actions, which might be a viable approach for learning equivariance from data. Acknowledgement This work was supported in part by Institute of Information & communications Technology Planning & Evaluation (IITP) (No. 2022-0-00926, 2022-0-00959, 2021-0-02068, and 2019-0-00075) and the National Research Foundation of Korea (NRF) (No. 2021R1C1C1012540) grants funded by the Korea government (MSIT). [1] J.-M. Andreoli. Convolution, attention and structure embedding. ar Xiv, 2019. (Cited on 7) [2] A.-L. Barabasi and R. Albert. Emergence of scaling in random networks. ar Xiv, 1999. (Cited on 10) [3] S. Bhattamishra, A. Patel, and N. Goyal. On the computational power of transformers and its implications in sequence modeling. In Co NLL, 2020. (Cited on 7) [4] S. Bhojanapalli, C. Yun, A. S. Rawat, S. J. Reddi, and S. Kumar. Low-rank bottleneck in multi-head attention models. In ICML, 2020. (Cited on 7) [5] S. Brody, U. Alon, and E. Yahav. How attentive are graph attention networks? In ICLR, 2022. (Cited on 11) [6] M. M. Bronstein, J. Bruna, Y. Le Cun, A. Szlam, and P. Vandergheynst. Geometric deep learning: Going beyond euclidean data. IEEE Signal Process. Mag., 2017. (Cited on 7) [7] T. B. Brown, B. Mann, N. Ryder, M. Subbiah, J. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell, S. Agarwal, A. Herbert-Voss, G. Krueger, T. Henighan, R. Child, A. Ramesh, D. M. Ziegler, J. Wu, C. Winter, C. Hesse, M. Chen, E. Sigler, M. Litwin, S. Gray, B. Chess, J. Clark, C. Berner, S. Mc Candlish, A. Radford, I. Sutskever, and D. Amodei. Language models are few-shot learners. In Neur IPS, 2020. (Cited on 1, 7, 10) [8] C. Cai and Y. Wang. A note on over-smoothing for graph neural networks. ar Xiv, 2020. (Cited on 1, 7) [9] L. Chen, K. Lu, A. Rajeswaran, K. Lee, A. Grover, M. Laskin, P. Abbeel, A. Srinivas, and I. Mordatch. Decision transformer: Reinforcement learning via sequence modeling. In Neur IPS, 2021. (Cited on 1, 7) [10] Z. Chen, L. Chen, S. Villar, and J. Bruna. Can graph neural networks count substructures? In Neur IPS, 2020. (Cited on 4) [11] K. M. Choromanski, V. Likhosherstov, D. Dohan, X. Song, A. Gane, T. Sarlós, P. Hawkins, J. Q. Davis, A. Mohiuddin, L. Kaiser, D. B. Belanger, L. J. Colwell, and A. Weller. Rethinking attention with performers. In ICLR, 2021. (Cited on 2, 9, 10, 11) [12] K. M. Choromanski, M. Rowland, and A. Weller. The unreasonable effectiveness of structured random orthogonal embeddings. In Neur IPS, 2017. (Cited on 3, 10) [13] T. Cohen and M. Welling. Group equivariant convolutional networks. In ICML, 2016. (Cited on 7) [14] T. S. Cohen and M. Welling. Steerable cnns. In ICLR, 2017. (Cited on 7) [15] J. Cordonnier, A. Loukas, and M. Jaggi. On the relationship between self-attention and convolutional layers. In ICLR, 2020. (Cited on 7) [16] Z. Dai, H. Liu, Q. V. Le, and M. Tan. Coatnet: Marrying convolution and attention for all data sizes. In Neur IPS, 2021. (Cited on 10) [17] J. Devlin, M. Chang, K. Lee, and K. Toutanova. BERT: pre-training of deep bidirectional transformers for language understanding. In NAACL-HLT, 2019. (Cited on 1, 2, 3, 4, 7, 10) [18] A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, J. Uszkoreit, and N. Houlsby. An image is worth 16x16 words: Transformers for image recognition at scale. In ICLR, 2021. (Cited on 1, 2, 3, 4, 7, 9, 10) [19] V. P. Dwivedi and X. Bresson. A generalization of transformer networks to graphs. ar Xiv, 2020. (Cited on 1, 7) [20] V. P. Dwivedi, C. K. Joshi, T. Laurent, Y. Bengio, and X. Bresson. Benchmarking graph neural networks. ar Xiv, 2020. (Cited on 3, 7, 10, 14) [21] J. Gilmer, S. S. Schoenholz, P. F. Riley, O. Vinyals, and G. E. Dahl. Neural message passing for quantum chemistry. In ICML, 2017. (Cited on 2, 4, 6) [22] A. N. Gorban, I. Y. Tyukin, D. V. Prokhorov, and K. I. Sofeikov. Approximation with random bases: Pro et contra. Inf. Sci., 2016. (Cited on 14) [23] B. Hanin and M. Sellke. Approximating continuous functions by relu nets of minimal width. ar Xiv, 2017. (Cited on 1, 7) [24] K. He, X. Chen, S. Xie, Y. Li, P. Dollár, and R. B. Girshick. Masked autoencoders are scalable vision learners. ar Xiv, 2021. (Cited on 10) [25] D. Hendrycks and K. Gimpel. Gaussian error linear units (gelus). ar Xiv, 2018. (Cited on 11) [26] K. Hornik, M. B. Stinchcombe, and H. White. Multilayer feedforward networks are universal approximators. Neural Networks, 1989. (Cited on 7) [27] W. Hu, M. Fey, H. Ren, M. Nakata, Y. Dong, and J. Leskovec. OGB-LSC: A large-scale challenge for machine learning on graphs. ar Xiv, 2021. (Cited on 2, 7, 9, 11) [28] G. Huang, Y. Sun, Z. Liu, D. Sedra, and K. Q. Weinberger. Deep networks with stochastic depth. In ECCV, 2016. (Cited on 11) [29] M. S. Hussain, M. J. Zaki, and D. Subramanian. Edge-augmented graph transformers: Global self-attention is enough for graphs. ar Xiv, 2021. (Cited on 1, 2, 7, 9, 10) [30] A. Jaegle, S. Borgeaud, J. Alayrac, C. Doersch, C. Ionescu, D. Ding, S. Koppula, D. Zoran, A. Brock, E. Shelhamer, O. J. Hénaff, M. M. Botvinick, A. Zisserman, O. Vinyals, and J. Carreira. Perceiver IO: A general architecture for structured inputs & outputs. ar Xiv, 2021. (Cited on 1, 7, 9) [31] A. Jaegle, F. Gimeno, A. Brock, O. Vinyals, A. Zisserman, and J. Carreira. Perceiver: General perception with iterative attention. In ICML, 2021. (Cited on 1, 7, 9) [32] A. Katharopoulos, A. Vyas, N. Pappas, and F. Fleuret. Transformers are rnns: Fast autoregressive transformers with linear attention. In ICML, 2020. (Cited on 9) [33] N. Keriven and G. Peyré. Universal invariant and equivariant graph neural networks. In Neur IPS, 2019. (Cited on 15) [34] J. Kim, S. Oh, and S. Hong. Transformers generalize deepsets and can be extended to graphs and hypergraphs. In Neur IPS, 2021. (Cited on 1, 2, 4, 6, 7, 15) [35] D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. In ICLR, 2015. (Cited on 14) [36] T. N. Kipf and M. Welling. Semi-supervised classification with graph convolutional networks. In ICLR, 2017. (Cited on 14) [37] Z. Kolter, S. Bai, D. Wilmott, M. Kornbluth, and J. Mailoa. First place solution of 2019 champs predicting molecular properties challenge. https://www.kaggle.com/c/champs-scalar-coupling/ discussion/106575, 2019. (Cited on 10) [38] D. Kreuzer, D. Beaini, W. L. Hamilton, V. Létourneau, and P. Tossou. Rethinking graph transformers with spectral attention. Neur IPS, 2021. (Cited on 7, 10, 9, 14) [39] J. Lee, Y. Lee, J. Kim, A. R. Kosiorek, S. Choi, and Y. W. Teh. Set transformer: A framework for attention-based permutation-invariant neural networks. In ICML, 2019. (Cited on 7, 9) [40] Q. Li, Z. Han, and X. Wu. Deeper insights into graph convolutional networks for semi-supervised learning. In AAAI, 2018. (Cited on 1, 7) [41] V. Likhosherstov, K. Choromanski, and A. Weller. On the expressive power of self-attention matrices. ar Xiv, 2021. (Cited on 7) [42] D. Lim, J. Robinson, L. Zhao, T. E. Smidt, S. Sra, H. Maron, and S. Jegelka. Sign and basis invariant networks for spectral graph representation learning. ar Xiv, 2022. (Cited on 7, 10, 14) [43] K. Lin, L. Wang, and Z. Liu. Mesh graphormer. In ICCV, 2021. (Cited on 1, 7) [44] I. Loshchilov and F. Hutter. Decoupled weight decay regularization. In ICLR, 2019. (Cited on 11) [45] X. Ma, X. Kong, S. Wang, C. Zhou, J. May, H. Ma, and L. Zettlemoyer. Luna: Linear unified nested attention. Neur IPS, 2021. (Cited on 9) [46] H. Maron, H. Ben-Hamu, H. Serviansky, and Y. Lipman. Provably powerful graph networks. In Neur IPS, 2019. (Cited on 2, 4, 6, 1, 15) [47] H. Maron, H. Ben-Hamu, N. Shamir, and Y. Lipman. Invariant and equivariant graph networks. In ICLR, 2019. (Cited on 2, 4, 5, 6, 7, 1, 15) [48] H. Maron, E. Fetaya, N. Segol, and Y. Lipman. On the universality of invariant networks. In ICML, 2019. (Cited on 4, 15) [49] H. Maron, O. Litany, G. Chechik, and E. Fetaya. On learning sets of symmetric elements. In ICML, 2020. (Cited on 7) [50] E. Min, R. Chen, Y. Bian, T. Xu, K. Zhao, W. Huang, P. Zhao, J. Huang, S. Ananiadou, and Y. Rong. Transformer for graphs: An overview from architecture perspective. ar Xiv, 2022. (Cited on 1, 7) [51] D. Q. Nguyen, T. D. Nguyen, and D. Phung. Universal graph transformer self-attention networks. In WWW, 2022. (Cited on 1, 7) [52] K. Oono and T. Suzuki. Graph neural networks exponentially lose expressive power for node classification. In ICLR, 2020. (Cited on 1, 7) [53] H. Pan and R. Kondor. Permutation equivariant layers for higher order interactions. In AISTATS, 2022. (Cited on 7) [54] W. Park, W. Chang, D. Lee, J. Kim, and S. won Hwang. Grpe: Relative positional encoding for graph transformer. ar Xiv, 2022. (Cited on 1, 2, 7, 9, 10) [55] H. Peng, N. Pappas, D. Yogatama, R. Schwartz, N. A. Smith, and L. Kong. Random feature attention. In ICLR, 2021. (Cited on 9) [56] S. Ravanbakhsh, J. G. Schneider, and B. Póczos. Equivariance through parameter-sharing. In ICML, 2017. (Cited on 7) [57] S. Reed, K. Zolna, E. Parisotto, S. G. Colmenarejo, A. Novikov, G. Barth-Maron, M. Giménez, Y. Sulsky, J. Kay, J. T. Springenberg, T. Eccles, J. Bruce, A. Razavi, A. Edwards, N. Heess, Y. Chen, R. Hadsell, O. Vinyals, M. Bordbar, and N. de Freitas. A generalist agent. ar Xiv, 2022. (Cited on 1) [58] Y. Rong, Y. Bian, T. Xu, W. Xie, Y. Wei, W. Huang, and J. Huang. Self-supervised graph transformer on large-scale molecular data. In Neur IPS, 2020. (Cited on 1, 7) [59] B. Rozemberczki, C. Allen, and R. Sarkar. Multi-scale attributed node embedding. J. Complex Networks, 2021. (Cited on 13) [60] H. Serviansky, N. Segol, J. Shlomi, K. Cranmer, E. Gross, H. Maron, and Y. Lipman. Set2graph: Learning graphs from sets. In Neur IPS, 2020. (Cited on 7, 15) [61] O. Shchur, M. Mumme, A. Bojchevski, and S. Günnemann. Pitfalls of graph neural network evaluation. ar Xiv, 2018. (Cited on 13) [62] Z. Shen, M. Zhang, H. Zhao, S. Yi, and H. Li. Efficient attention: Attention with linear complexities. In WACV, 2021. (Cited on 9) [63] Y. Shi, S. Zheng, G. Ke, Y. Shen, J. You, J. He, S. Luo, C. Liu, D. He, and T. Liu. Benchmarking graphormer on large-scale molecular modeling datasets. ar Xiv, 2022. (Cited on 9) [64] A. Steiner, A. Kolesnikov, X. Zhai, R. Wightman, J. Uszkoreit, and L. Beyer. How to train your vit? data, augmentation, and regularization in vision transformers. ar Xiv, 2021. (Cited on 11) [65] Y. Tay, M. Dehghani, D. Bahri, and D. Metzler. Efficient transformers: A survey. ar Xiv, 2020. (Cited on 1) [66] N. Thomas, T. E. Smidt, S. Kearnes, L. Yang, L. Li, K. Kohlhoff, and P. Riley. Tensor field networks: Rotationand translation-equivariant neural networks for 3d point clouds. ar Xiv, 2018. (Cited on 7) [67] J. Tompson, R. Goroshin, A. Jain, Y. Le Cun, and C. Bregler. Efficient object localization using convolutional networks. In CVPR, 2015. (Cited on 10) [68] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin. Attention is all you need. In Neur IPS, 2017. (Cited on 1, 3, 9) [69] P. Velickovic, G. Cucurull, A. Casanova, A. Romero, P. Liò, and Y. Bengio. Graph attention networks. In ICLR, 2018. (Cited on 1, 7, 11, 14) [70] H. Wang, S. Ma, L. Dong, S. Huang, D. Zhang, and F. Wei. Deepnet: Scaling transformers to 1, 000 layers. ar Xiv, 2022. (Cited on 10) [71] S. Wang, B. Z. Li, M. Khabsa, H. Fang, and H. Ma. Linformer: Self-attention with linear complexity. ar Xiv, 2020. (Cited on 9) [72] X. Wang, Z. Tu, L. Wang, and S. Shi. Self-attention with structural position representations. In EMNLPIJCNLP, 2019. (Cited on 1) [73] M. Weiler, M. Geiger, M. Welling, W. Boomsma, and T. Cohen. 3d steerable cnns: Learning rotationally equivariant features in volumetric data. In Neur IPS, 2018. (Cited on 7) [74] N. Wies, Y. Levine, D. Jannai, and A. Shashua. Which transformer architecture fits my data? A vocabulary bottleneck in self-attention. In ICML, 2021. (Cited on 10) [75] R. Xiong, Y. Yang, D. He, K. Zheng, S. Zheng, C. Xing, H. Zhang, Y. Lan, L. Wang, and T. Liu. On layer normalization in the transformer architecture. In ICML, 2020. (Cited on 11) [76] Y. Xiong, Z. Zeng, R. Chakraborty, M. Tan, G. Fung, Y. Li, and V. Singh. Nyströmformer: A nyström-based algorithm for approximating self-attention. In AAAI, 2021. (Cited on 9) [77] K. Xu, W. Hu, J. Leskovec, and S. Jegelka. How powerful are graph neural networks? In ICLR, 2019. (Cited on 7, 14) [78] C. Ying, T. Cai, S. Luo, S. Zheng, G. Ke, D. He, Y. Shen, and T. Liu. Do transformers really perform bad for graph representation? In Neur IPS, 2021. (Cited on 1, 2, 7, 9, 10, 11, 14) [79] F. X. Yu, A. T. Suresh, K. M. Choromanski, D. N. Holtmann-Rice, and S. Kumar. Orthogonal random features. In Neur IPS, 2016. (Cited on 3, 10) [80] K. Yuan, S. Guo, Z. Liu, A. Zhou, F. Yu, and W. Wu. Incorporating convolution designs into visual transformers. In ICCV, 2021. (Cited on 10) [81] C. Yun, S. Bhojanapalli, A. S. Rawat, S. J. Reddi, and S. Kumar. Are transformers universal approximators of sequence-to-sequence functions? In ICLR, 2020. (Cited on 7, 1) [82] M. Zaheer, S. Kottur, S. Ravanbakhsh, B. Póczos, R. Salakhutdinov, and A. J. Smola. Deep sets. In Neur IPS, 2017. (Cited on 7) [83] M. Zopf. 1-wl expressiveness is (almost) all you need. ar Xiv, 2022. (Cited on 2) The checklist follows the references. Please read the checklist guidelines carefully for information on how to answer these questions. For each question, change the default [TODO] to [Yes] , [No] , or [N/A] . You are strongly encouraged to include a justification to your answer, either by referencing the appropriate section of your paper or providing a brief inline description. For example: Did you include the license to the code and datasets? [Yes] See Section 5. Did you include the license to the code and datasets? [No] The code and the data are proprietary. Did you include the license to the code and datasets? [N/A] Please do not modify the questions and only use the provided macros for your answers. Note that the Checklist section does not count towards the page limit. In your paper, please delete this instructions block and only keep the Checklist section heading above along with the questions/answers below. 1. For all authors... (a) Do the main claims made in the abstract and introduction accurately reflect the paper s contributions and scope? [Yes] See Section 3 and Section 5. (b) Did you describe the limitations of your work? [Yes] See Section 6. (c) Did you discuss any potential negative societal impacts of your work? [Yes] See Section 6. (d) Have you read the ethics review guidelines and ensured that your paper conforms to them? [Yes] 2. If you are including theoretical results... (a) Did you state the full set of assumptions of all theoretical results? [Yes] See Section 3. (b) Did you include complete proofs of all theoretical results? [Yes] We include them in the supplementary file. 3. If you ran experiments... (a) Did you include the code, data, and instructions needed to reproduce the main experimental results (either in the supplemental material or as a URL)? [Yes] We include them in the supplementary file. (b) Did you specify all the training details (e.g., data splits, hyperparameters, how they were chosen)? [Yes] See Section 5. (c) Did you report error bars (e.g., with respect to the random seed after running experiments multiple times)? [Yes] See Section 5. (d) Did you include the total amount of compute and the type of resources used (e.g., type of GPUs, internal cluster, or cloud provider)? [Yes] See Section 5. 4. If you are using existing assets (e.g., code, data, models) or curating/releasing new assets... (a) If your work uses existing assets, did you cite the creators? [Yes] See Section 5.2. (b) Did you mention the license of the assets? [Yes] See Section 5.2. (c) Did you include any new assets either in the supplemental material or as a URL? [Yes] We include them in the supplementary file. (d) Did you discuss whether and how consent was obtained from people whose data you re using/curating? [N/A] (e) Did you discuss whether the data you are using/curating contains personally identifiable information or offensive content? [N/A] 5. If you used crowdsourcing or conducted research with human subjects... (a) Did you include the full text of instructions given to participants and screenshots, if applicable? [N/A] (b) Did you describe any potential participant risks, with links to Institutional Review Board (IRB) approvals, if applicable? [N/A] (c) Did you include the estimated hourly wage paid to participants and the total amount spent on participant compensation? [N/A]