# representational_strengths_and_limitations_of_transformers__67cfb5b1.pdf Representational Strengths and Limitations of Transformers Clayton Sanford, Daniel Hsu Department of Computer Science Columbia University New York, NY 10027 {clayton,djhsu}@cs.columbia.edu Matus Telgarsky Courant Institute New York University New York, NY 10012 matus.telgarsky@nyu.edu Attention layers, as commonly used in transformers, form the backbone of modern deep learning, yet there is no mathematical description of their benefits and deficiencies as compared with other architectures. In this work we establish both positive and negative results on the representation power of attention layers, with a focus on intrinsic complexity parameters such as width, depth, and embedding dimension. On the positive side, we present a sparse averaging task, where recurrent networks and feedforward networks all have complexity scaling polynomially in the input size, whereas transformers scale merely logarithmically in the input size; furthermore, we use the same construction to show the necessity and role of a large embedding dimension in a transformer. On the negative side, we present a triple detection task, where attention layers in turn have complexity scaling linearly in the input size; as this scenario seems rare in practice, we also present natural variants that can be efficiently solved by attention layers. The proof techniques emphasize the value of communication complexity in the analysis of transformers and related models, and the role of sparse averaging as a prototypical attention task, which even finds use in the analysis of triple detection. 1 Introduction In recent years, transformer networks [Vaswani et al., 2017] have been established as a fundamental neural architecture powering state-of-the-art results in many applications, including language modeling [Open AI, 2023], computer vision [Dosovitskiy et al., 2021], and protein folding [Jumper et al., 2021]. The key building block of transformer models is the self-attention unit, a primitive that represents interactions among input elements as inner-products between low-dimensional embeddings of these elements. The success of transformer models is linked to their ability to scale their training and generalization performance to larger datasets and sequence lengths. Their representational capacity, however, underlies this scaling power, and is tied to the inductive biases of their learning algorithms. Empirically, transformer models trained with gradient-based learning algorithms exhibit biases towards certain algorithmic primitives [Edelman et al., 2022, Liu et al., 2022] and learn representations that may encode domain-specific information in the self-attention units [Clark et al., 2019, Hewitt and Manning, 2019, Rogers et al., 2020, Chen et al., 2022]. These examples indicate that transformer architectures not only provide computational benefits, but also have representational capabilities that are particularly well-matched to practical tasks. In this paper, we investigate these inductive biases by identifying natural computational tasks for which transformers are well-suited, especially compared to other neural network architectures, as well as tasks that highlight the limitations of transformers. The tasks sparse averaging, pair-matching, 37th Conference on Neural Information Processing Systems (Neur IPS 2023). and triples-matching represent primitive operations that aggregate structural information encoded in embeddings. We use these tasks to elucidate the relationship between the embedding dimension m of a self-attention unit and its expressivity, and to showcase the fundamental representational limitations of self-attention layers. In our model, the primary computational bottleneck faced by a transformer in computing a sequenceto-sequence 1 function f : X N YN is the constrained processing of pairs of input elements {xi, xj} X 2 ; we allow transformers unbounded computational power when processing the individual elements xi X. This is motivated by modern scaling regimes where the context length N has rapidly increased, the self-attention embedding dimension m remains much smaller than N, and the parameterization of multi-layer perceptrons (MLPs) that operate on individual elements is much larger than m. Indeed, the largest GPT-3 model [Brown et al., 2020] features a context length N = 2048, an embedding dimension m = 128, and MLPs with a 12288-dimensional parameterization; the context length of GPT-4 is as large as N = 32000. As such, we are interested in the capabilities of transformers with N o(1) total size , as opposed to N Ω(1). The nature of the bottleneck in our model makes the tools of communication complexity indispensable for formalizing computational limits. 1.1 Our contributions Sparse averaging separations among atomic self-attention units. The q-sparse averaging task q SA aims to capture the essential approximation-theoretic properties of self-attention units. In q SA, the ith input xi is a pair (yi, zi), where zi Rd is the data part of xi, simply a vector in Rd , whereas and yi [N] q is the indexing part, which specifies q locations in the input sequence; the ith output element in q SA is obtained by averaging the q data parts zj given by j yi, meaning q SA ((y1, z1), . . . , (y N, z N)) = j y1 zj, . . . , 1 (See also Definition 4.) As summarized in the following informal theorem, our analysis of q SA in Section 3 and Appendix A illustrates the ability of the self-attention primitive to associate arbitrary subsets of input elements (as opposed to just local subsets, as specified by some sequential/topological structure), measures the expressive power accrued by increasing the embedding dimension m of a self-attention unit, and indicates the representational limitations of traditional neural architectures on basic computational tasks. Informal Theorem 1. The task q SA for q Z+ satisfies the following properties (see Definition 4 for a formal definition and approximation metric). 1. There exists a unit of self-attention f with an m-dimensional embedding that approximates q SA if and only if m q (Theorems 2 and 4). 2. Any fully-connected neural network whose output approximates q SA requires its first hidden layer to have width at least Ω(Nd) (Theorem 10). 3. Any recurrent neural network whose iterates approximate q SA requires a hidden state of at least Ω(N) bits (Theorem 11). We consider the q SA implementation in Item 1 efficient since the dimension of the model parameters grows with poly(q, d, log N), whereas the latter two are inefficient since their parameter (or state) dimension grows as poly(N). The proofs of the positive results employ embeddings for each index j and each subset yi that have large inner products if and only if j yi. The negative results involve communication complexity reductions and geometric arguments. These arguments naturally introduce a dependence on bits of precision, which we suppress above within the notation ; we note that these bounded-precision results are arguably more relevant to modern networks, which uses as few as 4 or even 2 bits of numerical precision. Contrast between pairwise and triple-wise matching with self-attention layers. We frame standard transformer architectures as being able to efficiently represent functions that are decomposable 1Note, however, that attention units are permutation equivariant, so the order of elements in the input sequence X X N is irrelevant. In practice, positional encodings are used when the sequence order is relevant. into sparse pairwise interactions between inputs. To do so, we introduce two sequential tasks and prove a collection of constructions and hardness results that characterize the abilities of transformers to solve these tasks. Given an input sequence X = (x1, . . . , x N) [M]N (for some M = poly(N)), we formalize the problems of similar pair detection (Match2) and similar triple detection (Match3) as Match2(X)i [N] = 1 { j s.t. xi + xj = 0 (mod M)} , (1) Match3(X)i [N] = 1 { j1, j2 s.t. xi + xj1 + xj2 = 0 (mod M)} . (2) For both tasks, note that the output is an N-dimensional vector whose ith element is 1 if and only if the sequence X includes a pair or triple containing xi. In this sense, the problems differ from 2SUM and 3SUM, which are not sequence-to-sequence tasks. We believe these two tasks are intrinsically pairwise and triple-wise , respectively; moreover, since we also believe self-attention performs a fundamentally pairwise operation, we will use Match2 and Match3 to show a sharp gap in the representation power of self-attention. Informal Theorem 2. 1. A single unit of standard self-attention with input and output MLPs and an O(d)-dimensional embedding can compute Match2 (Theorem 6). 2. A single layer of standard multi-headed self-attention cannot compute Match3 unless its number of heads H or embedding dimension m grows polynomially in N (Theorem 7). 3. A standard transformer model can efficiently compute a modified version of Match3 that makes assumptions about embedding structure or locality (Theorems 8 and 9). 4. Under a generalized notion of third-order tensor self-attention introduced in Appendix C.3, Match3 is efficiently computable with a single unit of third-order attention (Theorem 18). While the above result demonstrates the limitations of multi-headed self-attention and illustrates the importance of learning embeddings with contextual clues, we believe that a stronger result exists. Specifically, we conjecture that even multi-layer transformers are unable to efficiently compute Match3 without hints or augmentation. Informal Conjecture 1. Every multi-layer transformer that computes Match3 must have width, depth, embedding dimension, or bit complexity at least N Ω(1). In Appendices C.5 and C.6, we give a heuristic information-theoretic argument to support this conjecture, prove a matching upper-bound, and finally prove analogous results for graph-augmented transformers with respect to the problem of cycle detection in directed and undirected graphs. 1.2 Related work Several computational and learning-theoretic aspects of transformers, distinct from but related to the specific aims of the present paper, have been mathematically studied in previous works. Universality and Turing-completeness. To demonstrate the power of transformers, universal approximation results for transformers [Yun et al., 2020, Wei et al., 2022] analogous to results for feedforward networks [Hornik et al., 1989] establish the capability for sufficiently large networks to accurately approximate general classes of functions. Note, however, that the precise minimal dependence of the required size (e.g., number of attention units, depth of the network) as a function of the input size N does not directly follow from such results, and it is complicated by the interleaving of other neural network elements between attention layers. (Approximate) Turing-completeness of transformers demonstrates their power in a different manner, and such results have been established, first assuming infinite precision weights [Pérez et al., 2019] and later also with finite-precision [Wei et al., 2022]. Such results are more closely aligned with our aims, because Turing machines represent a uniform model of computation on inputs of arbitrary size. Wei et al. [2022] showed that Turing machines that run for T steps can be approximated by encoder-decoder transformers of depth log(T) and size polynomial in log(T) and the number of states of the Turing machine (but the decoder runs for T steps). Formal language recognition. The ubiquity of transformers in natural language understanding has motivated the theoretical study of their ability to recognize formal languages. On the positive side, Bhattamishra et al. [2020] constructed transformers that recognize counter languages, and Yao et al. [2021] showed that transformers of bounded size and depth can recognize Dyck languages that have bounded stack depth. Liu et al. [2022] showed that the computations of finite-state automata on sequences of length N can be performed by transformers of depth log(N) and size polynomial in the number of states. On the negative side, Hahn [2020] showed limitations of modeling distributions over formal languages (including Dyck) with fixed-size transformers (though this result does not imply quantitative lower bounds on the size of the transformer). Hahn [2020], as well as Hao et al. [2022], also establish the inability of hard attention Transformers to recognize various formal languages and circuit classes by leveraging depth reduction techniques from circuit complexity [Furst et al., 1984]. Learnability. The sample complexity of learning with low-weight transformers can be obtained using techniques from statistical learning theory and, in turn, establish learnability of certain boolean concept classes (e.g., sparse parity) [Edelman et al., 2022, Bhattamishra et al., 2022] using transformerbased hypothesis classes. Our q SA function is inspired by these classes, and we establish concrete size lower bounds for approximation (and hence also learnability) by transformers. We note that our constructions use bounded-size weights, and hence, in principle, the aforementioned sample complexity results can be combined with our results to analyze empirical risk minimization for learning transformers. Prior work of Likhosherstov et al. [2021] also shows how sparse attention patterns can be achieved by self-attention units (via random projection arguments); however, when specialized to q SA, their construction is suboptimal in terms of the sparsity level q. Related models. Graph neural networks (GNNs), like transformers, process very large inputs (graphs) using neural networks that act only on small collections of the input parts (vertex neighborhoods). Many classes of GNNs are universal approximators for classes of invariant and equivariant functions [Maron et al., 2019, Keriven and Peyré, 2019]. At the same time, they are restricted by the distinguishing power of certain graph isomorphism tests [Xu et al., 2018, Morris et al., 2019, Chen et al., 2019], and lower bounds have been established on the network size to approximate such tests [Aamand et al., 2022]. Loukas [2019] established a connection between GNNs and the LOCAL [Angluin, 1980] and CONGEST [Peleg, 2000] models for distributed computation, and hence directly translates lower bounds for CONGEST notably cycle detection problems into size lower bounds for GNNs. Our lower bounds for cycle detection using transformers also leverage a connection to the CONGEST model. However, transformers do not have the same limitations as GNNs, since the computational substrate of a transformer does not depend on the input graph in the way it is with GNNs. Thus, we cannot directly import lower bounds for CONGEST to obtain lower bounds for transformers. Transformers are also related to other families of invariant and equivariant networks. Our focus on Match2 and Match3 (and related problems) was inspired by the separation results of Zweig and Bruna [2022] between models for processing sets: Deep Sets [Qi et al., 2017, Zaheer et al., 2017], which are singleton symmetric , and the more expressive Relational Pooling networks [Santoro et al., 2017], which are only pairwise symmetric . 1.3 Conclusion and future work Our primary contributions are to present a multi-faceted story about transformer approximation: firstly, q SA separates transformer models approximation-theoretically from RNNs and MLPs, and moreover the attention embedding dimension both necessary and sufficient for q SA scale directly with q, meaning q SA also functions to characterize representation power amongst different transformers. Secondly, while single units of self-attention can solve the Match2 task, even wide layers of selfattention with high-dimensional embeddings cannot solve Match3, and we believe that deeper models cannot as well. This question of deeper models is stated as a formal conjecture and addressed heuristically in Appendix C.6, using both informationand communication-theoretic proof techniques, both of which we feel are significant steps towards a complete proof. While our investigation is purely approximation-theoretic, we also include in Appendix D a preliminary empirical study, showing that attention can learn q SA with vastly fewer samples than recurrent networks and MLPs; we feel this further emphasizes the fundamental value of q SA, and constitutes an exciting direction for future work. Beyond the explicit open question in Informal Conjecture 1, we anticipate that future research could connect the separation results proved in this work to formal linguistic theory and empirical work on attention matrix interpretation. This work examines Match2 and Match3 because we believe that the former could represent a key primitive for language processing tasks such as co-referencing, while the latter represents a natural extension of the former that likely is not necessary for language modeling. Rather, it may be possible that language modeling performs triple-wise modeling for tasks such as the identification of subject, verb, and object components by relying on pairwise matching constructions and clues learned within an embedding, such as those encoded in the toy problems Match3Bigram and Match3Local. That is, transformers serve as a useful foundational model for language modeling because of their abilities to integrate contextual clues and pairwise communication, and while they are not extensible to purely triple-wise problems, most practical sequential problems have some efficient decomposition to pairwise structures that can be easily exploited by these architectures. Future work by linguists, theoretical computer scientists, and empirical NLP practitioners could assess how foundational our primitives are and study whether there are any practical triple-wise problems that transformer models fail to solve. 2 Preliminaries Let Bd = {x Rd : x 2 1} denote the unit ball in Rd, and let [n] = {1, 2, . . . , n} denote the first n positive integers. The expression 1 {P} equals 1 if predicate P is true and 0 otherwise. The row-wise softmax operator applied to matrix A RN M returns softmax(A)i,j = exp(Ai,j) PM j =1 exp(Ai,j ) . 2.1 Attention units and transformer architectures We first introduce the concept of self-attention, which is used as the building block of all transformer architectures included in this paper. Definition 1. For input dimension d, output dimension d , embedding dimension m, precision p, and matrices Q, K Rd m and V Rd d (encoded using p-bit fixed-point numbers), a self-attention unit is a function f Q,K,V : RN d RN d with f Q,K,V (X) = softmax(XQK TX T)XV. Let Ad,m,d ,p = {f Q,K,V : Q, K, V } denote all such self-attention units. Self-attention units can be computed in parallel to create multi-headed attention. Definition 2. For head-count H and self-attention units f1, . . . , f H Ad,m,d ,p, a multi-headed attention layer is a function Lf1,...,f H : RN d RN m with Lf1,...,f H(X) = PH h=1 fh(X). Let AH d,m,d ,p contain all such Lf1,...,f H. Transformer models are composed of two components: multi-headed attention layers (as above) and element-wise multi-layer perceptrons. Due to universal approximation results, we model multi-layer perceptrons as arbitrary functions mapping fixed-precision vectors to themselves. Definition 3. A multi-layer perceptron (MLP) layer is represented by some ϕ : Rd Rd , whose real-valued inputs and outputs can be represented using p-bit fixed-precision numbers. We apply ϕ to each element (i.e., row) of an input X RN d, abusing notation to let ϕ(X) = (ϕ(x1), . . . , ϕ(x N)) RN d . Let Φd,d ,p denote all such MLPs. We concatenate the notation of each class of functions to denote function composition. For example, for output dimension d , we use A d,m,d ,p := Am,m,d ,pΦd,m,p and AH d,m,d ,p := AH m,m,d ,pΦd,m,p to represent single-headed and multi-headed attention units with an input MLP respectively. (The capabilities and limitations of these models are studied in Section 3.) For depth D, we let T D,H d,m,d ,p = Φm,d ,p(AH m,m,m,p)D 1AH d,m,m,p represent a full transformer model comprising D layers of H-headed self-attention with interspersed MLPs. While two key features of transformer architectures the residual connection and the positional embedding are conspicuously missing from this formalism, the two can be implemented easily under the framework. We can include a positional embedding by encoding the index as a coordinate of the input, i.e. xi,1 = i. Then, the subsequent MLP transformation ϕ(X) can incorporate i suitably into the embedding. A residual connection can be included additively as input to a multi-layer perceptron layer (as is standard) by implementing an approximate identity attention head f with Q, K and V = Im set to ensure that f(X) X.2 We periodically consider transformers implemented with real-valued arithmetic with infinite bit complexity; in those cases, we omit the bit complexity p from the notation. Finally, we assume for the proof of Theorem 3 that the model is permitted to append a single token at the end of a sequence. That is, we say that a model f T D,H d,m,d ,p represents a target h : RN d RN d if f(X )1:N = g(X) when X = (x1, . . . , x N, x ) for constant-valued x Rd. 3 Sparse averaging with attention units We present the sparse averaging task to highlight the ability of transformer architectures to simulate a wide range of meaningful interactions between input elements. This task demonstrates how the embedding dimension of a self-attention unit modulates the expressive capabilities of the architecture, while showcasing the inabilities of fully-connected and recurrent neural networks to capture similar interactions (see Appendix A). Definition 4. For sparsity q, problem dimension d , and input dimension d = d + q + 1, consider an input X = (x1, . . . , x N) RN d with xi = (zi; yi; i) for zi Bd and yi [N] q .3 Let the q-sparse average be For accuracy ϵ > 0, a function f : RN d RN d ϵ-approximates q SA if for all X, max i [N] f(X)i q SA(X)i 2 ϵ. Figure 1a visualizes the sparse averaging task as a bipartite graph between subsets yi and elements zi with corresponding averages. Theorems 2 and 4 jointly show that the minimum embedding dimension m of single self-attention units A d,m,d ,p that O( 1 q)-approximate q SA scales linearly with q. We believe that the sparse averaging problem is thus a canonical problem establishing the representational capabilities and inductive biases of self-attention units. 3.1 Self-attention can approximate q SA when m q Our principle positive result shows that the sparse averaging task q SA can be approximately solved using fixed-precision arithmetic self-attention units with embedding dimension m growing with q log N. Theorem 2 (Fixed-precision). For any N, any m Ω(d + q log N), any ϵ (0, 1), and p = Ω(log( q ϵ log N)), there exists some f A d,m,d ,p that ϵ-approximates q SA. While the full proof appears in Appendix B.1, we briefly sketch the argument here. Because the output of a self-attention unit is a convex combination of rows of the value matrix ϕ(X)V RN d , a natural way to approximate q SA with a unit of self-attention is to let each value be the corresponding 2A simple construction involves letting XQ = XK with iid Gaussian columns fixed for every index i. Then, the diagonals of XQKTXT are far larger than all other entries and its softmax is approximately IN. 3We may encode a q element subset of [N] as a vector in [N]q constrained to have distinct components. (a) Bipartite graph relating yi and zi in q SA(X). (b) Attention and value matrices used for the selfattention construction of q SA(X) in Theorem 2. (c) Key and query embeddings that produce the self-attention matrix in (b). Figure 1: A visualization of the q SA function outputs given a sequence of inputs (zi; yi; i)i [N] as a bipartite graph between subsets yi and vectors zi (a), and of the attention matrix (b) and underlying embeddings (c) that produce the self-attention construction in Theorem 2. (b) T = 1000. (c) T = 40000. Figure 2: Attention matrix softmax(ϕ(X)QKTϕ(X)T) R20 20 for a fixed example after T epochs of training a self-attention unit to solve q SA for q = 3. Each row i corresponds to subset yi, and each cell j yi is outlined in red. See Appendix D for experimental details. vector in the average (i.e. V Tϕ(xi) = zi) and choose the key and query functions in order to ensure that the attention matrix satisfies softmax(ϕ(X)QK Tϕ(X) T)i,j ( 1 q if j yi, 0 otherwise. To do so, let each key KTϕ(xi) represent a fixed vertex on a convex polytope, which depends only on index i and is constructed from random binary vectors. We select each query QTϕ(xi) to ensure that ϕ(xi)TQKTϕ(xj) is a fixed large value if j yi and a slightly smaller value otherwise. We obtain the precise query, key, and value embeddings by employing tools from dual certificate analysis from the theory of compressed sensing. We visualize this construction in Figure 1b and 1c for q = 3 and d = 4, which presents the associated attention and value matrices necessary for the construction, and plots a polytope of keys (red dots) with each face corresponding to each subset yi (green dots). The construction is empirically relevant; Figure 2 shows that a unit of self-attention trained on data generated by the q SA task recovers a similar attention matrix to the one stipulated in our construction and visualized in Figure 1b. The logarithmic dependence of the embedding dimension m on the sequence length N can be eliminated by considering self-attention units with real-valued arithmetic with infinite bit complexity. Theorem 3 (Infinite-precision). For fixed N, m Ω(d +q) and ϵ > 0, there exists some f A d,m,d that ϵ-approximates q SA. Figure 3: The mp-bit communication protocol used to reduce the hardness of computing q SA with a single unit of self-attention to the hardness of solving the DISJ communication problem for the proof of Theorem 4 for q = 4. The proof of Theorem 3 employs a similar polytope-based construction in Appendix B.2, relying on a cyclic polytope rather than one drawn from discrete boolean vectors. Theorem 16 proves the near-optimality of that bound by employing a geometric argument to show that a variant of q SA can only be approximated by a restricted family of self-attention units with a sufficiently high-dimensional embedding. 3.2 Self-attention cannot approximate q SA when m q We show that the construction used to prove Theorem 2 is nearly optimal. Theorem 4. For any sufficiently large q, any N 2q + 1, and any d 1, there exists a universal constant c such that if mp cq, then no f T 1,1 d,m,d ,p exists that 1 2q-approximates q SA. (By choosing p = O(log(q log N)), Theorem 2 is shown to be optimal up to logarithmic factors of q and doubly-logarithmic factors of N.) The proof of Theorem 4 employs a standard communication complexity argument based on a reduction from the following set disjointness problem in the two-party communication model, in which each party possesses a subset of an n element domain (encoded as n-bit strings), and they wish to jointly determine whether their subsets are disjoint. We note that communication complexity is commonly-used technique for proving lower bounds on the representational power of circuits and feedforward neural networks [see, e.g., Karchmer and Wigderson, 1988, Ben-David et al., 2002, Martens et al., 2013, Vardi et al., 2021]. Fact 5 (Set disjointness communication lower bound [Yao, 1979]). Suppose Alice and Bob are given inputs a, b {0, 1}n, respectively, with the goal of jointly computing DISJ(a, b) = maxi aibi by alternately sending a single bit message to the other party over a sequence of communication rounds. Any deterministic protocol for computing DISJ(a, b) requires at least n rounds of communication. Our proof designs a communication protocol that Alice and Bob use to jointly compute DISJ(a, b) when n = q in O(mp) rounds of communication, under the assumption that such an f exists that closely approximates q SA. Alice encodes her input a in a single subset by letting y2q+1 = {2i + ai 1 : i [q]}. Bob uses his input b to assign z2i 1 to 2bi 1 and z2i = 1 for all i [q]. All other input components are set to constant values known by both parties. Alice sends her mp-bit query embedding QTϕ(x2q+1) bit-by-bit to Bob, who approximately computes q SA by determining the outcome of f. The crux of the reduction shows that q SA(X)2q+1 = 1 if and only if aibi = 0 for all i [q], which allows Bob to determine DISJ(a, b). We visualize the protocol in Figure 3 and give the proof in Appendix B.3. The proofs of Theorems 7, 11, 21, and 23 employ similar communication complexity reductions to DISJ. 4 Standard transformer models can only efficiently represent intrinsically pairwise functions In this section, we argue that the standard transformer architecture is unable to efficiently represent functions that do not decompose into a small number of pairwise-symmetric functions. We do this by contrasting the (in)approximability of intrinsically pairwise and triple-wise functions, respectively Match2 and Match3 (defined in (1) and (2)), and their variants. 4.1 Efficient computation of Match2 with standard self-attention We first show that Match2 can be efficiently approximated by a single standard (pairwise) selfattention unit. Theorem 6. For any input size N, input range M = N O(1), and fixed-precision bit complexity p = O(log M), there exists a transformer architecture f T 1,1 1,m,1,p with a single self-attention unit with embedding dimension m = 3 such that for all X [M]N, f(X) = Match2(X). The proof, given in Appendix C.1 uses both a blank token and a trigonometric positional embedding, which ensures that ϕ(xi) TQK Tϕ(xj) = c k=1 cos 2π(xi,k + xj,k) for some sufficiently large constant c. This embedding ensures that a cell of the attention matrix softmax(ϕ(X)QKTϕ(X)T)i,j is extremely close to zero, unless xi = xj (mod M). 4.2 Hardness of computing Match3 with a multi-headed self-attention layer Although Match2 can be efficiently represented using a single unit of standard self-attention, representing Match3 using an entire layer of multi-headed attention units is impossible unless either the number of heads H, the embedding dimension m, or the precision p grows as N Ω(1). Theorem 7. There is universal constant c > 0 such that for sufficiently large N, and any M N +1, if mp H c N/ log log N, then there is no f T 1,H 1,m,1,p satisfying f(X) = Match3(X) for all X [M]N. We give the proof in Appendix C.2. Like that of Theorem 4, the proof relies on a reduction from set disjointness in two-party communication. The proof of the lower bound applies a domain-restricted variant of Match3, which actually makes the problem substantially simpler to solve. In Remark 1, we show how this variant of Match3 introduces a depth separation between the representational powers of single-layer and two-layer transformer models. As mentioned in the introduction, we also conjecture that multiple layers of multi-headed attention are subject to the same impossibility (Conjecture 19). The impossibility is specific to standard (pairwise) attention; in Appendix C.4, we show that Match3 can be efficiently computed with a single unit of third-order self-attention. 4.3 More efficient constructions for simplified Match3 computations While the previous sections suggests that no efficient construction exists to compute Match3 with standard transformer models, practical examples of triple detection abound. For example, a transformerbased language model will likely succeed in linking a subject/verb/object triple because all three tokens likely inhabit the same local region and because the model could agglomerate the triple by first identifying a pair and then adding the third. Here, we introduce two variants on the Match3 problem that have additional structure to serve as hints. The first variant specifies triple sums comprising the input element and a neighboring pair elsewhere in the sequence: for each i [N], Match3Bigram(X)i = 1 { j s.t. xi + xj + xj+1 = 0 (mod M)} . The second focuses on localized sums, where are all components of a triple must be within a fixed range of constant width K N: for each i [N], Match3Local(X)i = 1 { j1, j2 s.t. xi + xj1 + xj2 = 0 (mod M), |i j1| , |i j2| K} . We show that the two can be efficiently represented using compact standard transformer models. Theorem 8. For any N, M = N O(1), and p = O(log M), there exists a transformer architecture f T D,1 1,m,1,p with embedding dimension m = 3 and depth D = 2 such that for all X [M]N d, f(X) = Match3Bigram(X). Informally, the first layer of the construction uses a sinusoidal positional encoding to compute each bigram sum xj + xj+1 in the jth element of the sequence. The second layer applies the Match2 construction provided by Theorem 6 to determine whether there exists a j for each i such that xi + xj + xj+1 = 0 (mod M). Theorem 9. For any d, N, M = N O(1), p = O(log M), and K N, there exists a transformer architecture f T 1,1 1,m,1,p with embedding dimension m = O(K log N) and bit-complexity p = O(log(K log N)) such that for all X [M]N d, f(X) = Match3Local(X). Proof. We implement the localized construction by using Theorem 2 to construct a specific sparse simultaneous average of the inputs with q := 2K + 1 and d := 2K + 1. To do so, we use the input MLP to convert xi to the embedding (zi; yi; i), for zero-padded input zi = xie i R2K+1 for i = i (mod 2K + 1) and subset yi = {i K, i K + 1, . . . , i + K} [N] 2K + 1 This construction ensures that the ith element of self-attention output computes (a rotation of) (xi K, xi K+1, . . . , xi+K). An output MLP can then verify whether any matching triples involving xi exist among those vectors. Acknowledgments and Disclosure of Funding We are grateful for many discussions with and feedback from Navid Ardeshir, Peter Bartlett, Alberto Bietti, Yuval Efron, Christos Papadimitriou, Shivam Nadimpalli, Rocco Servedio, Yusu Wang, and Cyril Zhang. This work was supported in part by NSF grants CCF-1740833 and IIS-1563785, a JP Morgan Faculty Award, and an NSF Graduate Research Fellowship. Anders Aamand, Justin Chen, Piotr Indyk, Shyam Narayanan, Ronitt Rubinfeld, Nicholas Schiefer, Sandeep Silwal, and Tal Wagner. Exponentially improving the complexity of simulating the Weisfeiler-Lehman test with graph neural networks. In Advances in Neural Information Processing Systems 35, 2022. Dana Angluin. Local and global properties in networks of processors. In Proceedings of the Twelfth Annual ACM Symposium on Theory of Computing, 1980. Shai Ben-David, Nadav Eiron, and Hans Ulrich Simon. Limitations of learning via embeddings in euclidean half spaces. Journal of Machine Learning Research, 3(Nov):441 461, 2002. Satwik Bhattamishra, Kabir Ahuja, and Navin Goyal. On the ability and limitations of transformers to recognize formal languages. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing, 2020. Satwik Bhattamishra, Arkil Patel, Varun Kanade, and Phil Blunsom. Simplicity bias in transformers and their ability to learn sparse boolean functions. ar Xiv preprint ar Xiv:2211.12316, 2022. Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam Mc Candlish, Alec Radford, Ilya Sutskever, and Dario Amodei. Language models are few-shot learners. ar Xiv preprint ar Xiv:2005.14165, 2020. Emmanuel J Candes and Terence Tao. Decoding by linear programming. IEEE transactions on information theory, 51(12):4203 4215, 2005. Nuo Chen, Qiushi Sun, Renyu Zhu, Xiang Li, Xuesong Lu, and Ming Gao. Cat-probing: A metric-based approach to interpret how pre-trained models for programming language attend code structure. ar Xiv preprint ar Xiv:2210.04633, 2022. Zhengdao Chen, Soledad Villar, Lei Chen, and Joan Bruna. On the equivalence between graph isomorphism testing and function approximation with GNNs. In Advances in Neural Information Processing Systems 32, 2019. Kevin Clark, Urvashi Khandelwal, Omer Levy, and Christopher D Manning. What does bert look at? an analysis of bert s attention. ar Xiv preprint ar Xiv:1906.04341, 2019. Amit Daniely. Depth separation for neural networks. In Satyen Kale and Ohad Shamir, editors, Proceedings of the 2017 Conference on Learning Theory, volume 65 of Proceedings of Machine Learning Research, pages 690 696. PMLR, 07 10 Jul 2017. URL https://proceedings.mlr. press/v65/daniely17a.html. Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. An image is worth 16x16 words: Transformers for image recognition at scale. ar Xiv preprint ar Xiv:2010.11929, 2021. Benjamin L. Edelman, Surbhi Goel, Sham M. Kakade, and Cyril Zhang. Inductive biases and variable creation in self-attention mechanisms. In International Conference on Machine Learning, 2022. Ronen Eldan and Ohad Shamir. The power of depth for feedforward neural networks. In Vitaly Feldman, Alexander Rakhlin, and Ohad Shamir, editors, 29th Annual Conference on Learning Theory, volume 49 of Proceedings of Machine Learning Research, pages 907 940, Columbia University, New York, New York, USA, 23 26 Jun 2016. PMLR. URL https://proceedings. mlr.press/v49/eldan16.html. Merrick Furst, James B Saxe, and Michael Sipser. Parity, circuits, and the polynomial-time hierarchy. Mathematical systems theory, 17(1):13 27, 1984. David Gale. Neighborly and cyclic polytopes. In Proc. Sympos. Pure Math, volume 7, pages 225 232, 1963. Michael Hahn. Theoretical limitations of self-attention in neural sequence models. Trans. Assoc. Comput. Linguistics, 8:156 171, 2020. doi: 10.1162/tacl\_{a}{\_{0}{0}{3}}{0}6. URL https: //doi.org/10.1162/tacl_a_00306. Yiding Hao, Dana Angluin, and Robert Frank. Formal language recognition by hard attention transformers: Perspectives from circuit complexity. Trans. Assoc. Comput. Linguistics, 10:800 810, 2022. URL https://transacl.org/ojs/index.php/tacl/article/view/3765. John Hewitt and Christopher D Manning. A structural probe for finding syntax in word representations. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 2019. Kurt Hornik, Maxwell Stinchcombe, and Halbert White. Multilayer feedforward networks are universal approximators. Neural Netw., 2(5):359 366, July 1989. John Jumper, Richard Evans, Alexander Pritzel, Tim Green, Michael Figurnov, Olaf Ronneberger, Kathryn Tunyasuvunakool, Russ Bates, Augustin Žídek, Anna Potapenko, et al. Highly accurate protein structure prediction with alphafold. Nature, 596(7873):583 589, 2021. Mauricio Karchmer and Avi Wigderson. Monotone circuits for connectivity require super-logarithmic depth. In Proceedings of the Twentieth Annual ACM Symposium on Theory of Computing, 1988. Nicolas Keriven and Gabriel Peyré. Universal invariant and equivariant graph neural networks. In Advances in Neural Information Processing Systems 32, 2019. Valerii Likhosherstov, Krzysztof Choromanski, and Adrian Weller. On the expressive power of self-attention matrices. ar Xiv preprint ar Xiv:2106.03764, 2021. Bingbin Liu, Jordan T. Ash, Surbhi Goel, Akshay Krishnamurthy, and Cyril Zhang. Transformers learn shortcuts to automata. Co RR, abs/2210.10749, 2022. doi: 10.48550/ar Xiv.2210.10749. Andreas Loukas. What graph neural networks cannot learn: depth vs width. ar Xiv preprint ar Xiv:1907.03199, 2019. Haggai Maron, Ethan Fetaya, Nimrod Segol, and Yaron Lipman. On the universality of invariant networks. In International Conference on Machine Learning, 2019. James Martens, Arkadev Chattopadhya, Toni Pitassi, and Richard Zemel. On the representational efficiency of restricted boltzmann machines. In Advances in Neural Information Processing Systems 26, 2013. Shahar Mendelson, Alain Pajor, and Nicole Tomczak-Jaegermann. Reconstruction and subgaussian operators in asymptotic geometric analysis. Geometric and Functional Analysis, 17(4):1248 1282, 2007. Christopher Morris, Martin Ritzert, Matthias Fey, William L Hamilton, Jan Eric Lenssen, Gaurav Rattan, and Martin Grohe. Weisfeiler and leman go neural: Higher-order graph neural networks. In AAAI Conference on Artificial Intelligence, 2019. Open AI. Gpt-4 technical report, 2023. David Peleg. Distributed computing: a locality-sensitive approach. SIAM, 2000. Jorge Pérez, Javier Marinkovi c, and Pablo Barceló. On the turing completeness of modern neural network architectures. ar Xiv preprint ar Xiv:1901.03429, 2019. Charles R Qi, Hao Su, Kaichun Mo, and Leonidas J Guibas. Pointnet: Deep learning on point sets for 3d classification and segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017. Anna Rogers, Olga Kovaleva, and Anna Rumshisky. A primer in bertology: What we know about how bert works. Transactions of the Association for Computational Linguistics, 8:842 866, Dec 2020. ISSN 2307-387X. doi: 10.1162/tacl_a_00349. URL http://dx.doi.org/10.1162/ tacl_a_00349. Adam Santoro, David Raposo, David G Barrett, Mateusz Malinowski, Razvan Pascanu, Peter Battaglia, and Timothy Lillicrap. A simple neural network module for relational reasoning. In Advances in Neural Information Processing Systems 30, 2017. Norbert Sauer. On the density of families of sets. Journal of Combinatorial Theory, Series A, 13(1): 145 147, 1972. Saharon Shelah. A combinatorial problem; stability and order for models and theories in infinitary languages. Pacific Journal of Mathematics, 41(1):247 261, 1972. Matus Telgarsky. Benefits of depth in neural networks. In Vitaly Feldman, Alexander Rakhlin, and Ohad Shamir, editors, 29th Annual Conference on Learning Theory, volume 49 of Proceedings of Machine Learning Research, pages 1517 1539, Columbia University, New York, New York, USA, 23 26 Jun 2016. PMLR. URL https://proceedings.mlr.press/v49/telgarsky16.html. Vladimir Naumovich Vapnik and Aleksei Yakovlevich Chervonenkis. The uniform convergence of frequencies of the appearance of events to their probabilities. Doklady Akademii Nauk, 181(4): 781 783, 1968. Gal Vardi, Daniel Reichman, Toniann Pitassi, and Ohad Shamir. Size and depth separation in approximating benign functions with neural networks. In Conference on Learning Theory, 2021. Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Advances in Neural Information Processing Systems 30, 2017. Colin Wei, Yining Chen, and Tengyu Ma. Statistically meaningful approximation: a case study on approximating turing machines with transformers. Advances in Neural Information Processing Systems, 35:12071 12083, 2022. Keyulu Xu, Weihua Hu, Jure Leskovec, and Stefanie Jegelka. How powerful are graph neural networks? ar Xiv preprint ar Xiv:1810.00826, 2018. Andrew Chi-Chih Yao. Some complexity questions related to distributive computing (preliminary report). In Proceedings of the Eleventh Annual ACM Symposium on Theory of Computing, 1979. Shunyu Yao, Binghui Peng, Christos H. Papadimitriou, and Karthik Narasimhan. Self-attention networks can process bounded hierarchical languages. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing, 2021. Chulhee Yun, Srinadh Bhojanapalli, Ankit Singh Rawat, Sashank Reddi, and Sanjiv Kumar. Are transformers universal approximators of sequence-to-sequence functions? In International Conference on Learning Representations, 2020. Manzil Zaheer, Satwik Kottur, Siamak Ravanbakhsh, Barnabas Poczos, Russ R Salakhutdinov, and Alexander J Smola. Deep sets. In Advances in Neural Information Processing Systems 30, 2017. Günter M Ziegler. Lectures on polytopes. Graduate Texts in Mathematics, 152, 2006. Aaron Zweig and Joan Bruna. Exponential separations in symmetric neural networks. Co RR, abs/2206.01266, 2022. doi: 10.48550/ar Xiv.2206.01266. A Fully-connected neural networks and recurrent neural networks cannot efficiently approximate q SA A.1 Only wide fully-connected neural networks can approximate q SA In this section, we show that any fully-connected neural network that approximates q SA : RNd RNd must have width m = Ω(N).4 We consider networks of the form f(x) = g(Wx) for some weight matrix W Rm Nd (the first layer weights) and arbitrary function g : Rm RNd (computed by subsequent layers of a neural network). Theorem 10. Suppose q N 2 . Any fully-connected neural network f defined as above that 1 2q- approximates q SA satisfies m rank(W) Nd Proof. For simplicity, we arrange the input as x = (1; . . . ; N; y1; . . . ; y N; z1; . . . ; z N) and W = [ W; V1; . . . ; VN] with z1, . . . , z N Bd , W Rm N(d d ), and V1, . . . , VN Rm d . If rank(W) Nd 2 1, then so too is rank([Vq; . . . ; VN]) Nd 2 1, and [Vq; . . . ; VN] has a nontrivial null space containing a nonzero vector u = (uq; . . . ; u N) R(N q)d . Let ξ = 1 maxj {q,...,N} uj 2 (uq; . . . ; u N), z = ( 0; . . . ; 0; ξq; . . . ; ξN), and z = ( 0; . . . ; 0; ξq; . . . ; ξN). Then, 1. zj, z j Bd for all j [N]; 2. Vjzj = Vjz j = 0 for all j [N]; and 3. zj z j 2 = 2 for some j {q, . . . , N}. Therefore, for any y1, . . . , y N [N] q , respective x = (1; . . . ; N; y1; . . . ; y N; z1; . . . ; z N) and x = (1; . . . ; N; y1; . . . ; y N; z 1; . . . ; z N) satisfy f(x) = f(x ). Consider y with yj = (1, . . . , q 1, j) for each j {q, . . . , N}. Then, q SA(x)j = 1 q ξj and q SA(x )j = 1 Hence, q SA(x)j q SA(x )j 2 2 q. Because f(x) = f(x ), max f(x) q SA(x)j 2 , f(x ) q SA(x )j 2 1 so f can approximate q SA to accuracy no better than 1 A.2 Only high-memory recurrent neural networks can approximate q SA In this section, we show that any memory-bounded algorithm that approximates q SA : RN d RN d must use a large hidden state (memory) as it processes the input elements. This lower bound applies to various recurrent neural network (RNN) architectures. A memory-bounded algorithm with an m-bit memory processes input X RN d sequentially as follows. There is an initial memory state h0 {0, 1}m. For i = 1, 2, . . . , N, the algorithm computes the i-th output f(X)i Rd and the updated memory state hi as a function of the input xi Rd and previous memory state hi 1: (f(X)i, hi) = gi(xi, hi 1), 4We regard inputs as Nd-dimensional vectors rather than N d matrices. where gi : Rd {0, 1}m Rd {0, 1}m is permitted to be an arbitrary function, and f : RN d RN d is the function computed by the algorithm. Our lower bound applies to algorithms that only need to solve the subclass of causal instances of q SA in which the input X = ((zi, yi, i))i [N] RN d is promised to satisfy yi = for all i N/2 + 1, and yi {1, . . . , N/2 + 1} for all i > N/2 + 1. Theorem 11. For any ε (0, 1), any memory-bounded algorithm that ε-approximates q SA (for q = 1 and d = 1) on the subclass of causal instances must have memory m (N 1)/2. Proof. Consider an m-bit memory-bounded algorithm computing a function f : RN d RN that ε-approximates q SA (for q = 1 and d = 1). We construct, from this algorithm, a communication protocol for DISJ (with N = 2n + 1) that uses m bits of communication. Let a, b {0, 1}n be the input for DISJ provided to Alice and Bob, respectively. The protocol is as follows. 1. Alice constructs inputs xi = (zi, , i) for i = 1, . . . , n + 1, where for each i = 1, . . . , n, zi = +1 if ai = 0, 1 if ai = 1, and zn+1 = +1. Bob constructs inputs xn+1+i = (0, yn+1+i, n + 1 + i) for i = 1, . . . , n, where yn+1+i = {n + 1} if bi = 0, {i} if bi = 1. Observe that, for this input X = (x1, . . . , x2n+1), we have q SA(X)n+1+i = +1 if aibi = 0, 1 if aibi = 1. 2. Alice simulates the memory-bounded algorithm on the first n + 1 inputs x1, . . . , xn+1, and sends Bob the m-bit memory state hn+1. This requires m bits of communication. 3. Starting with hn+1, Bob continues the simulation of the memory-bounded algorithm on these n additional inputs xn+2, . . . , x2n+1. 4. If any output f(X)n+1+i for i = 1, . . . , n satisfies f(X)n+1+i < 0, then Bob outputs 1 (not disjoint); otherwise Bob outputs 0 (disjoint). The approximation guarantee of f implies that sign(f(X)n+1+i) = q SA(X)n+1+i for all i = 1, . . . , n, so Bob outputs 1 if and only if a and b are not disjoint. Because this protocol for DISJ uses m bits of communication, by Fact 5, it must be that m n = (N 1)/2. We note that the proof of Theorem 11 can be simplified by reducing from the INDEX problem, which has a 1-way communication lower bound of n bits. This suffices for "single pass" algorothms, such as standard RNNs. However, the advantage of the above argument (and reducing from DISJ) is that it easily extends to algorithms that make multiple passes over the input. Such algorithms are able to capture bidirectional recurrent neural net and related models. A straightforward modification of the protocol in the proof of Theorem 11 shows that Ω(N) memory is required for any algorithm that makes O(1) passes over the input (and computes the outputs in a final pass). B Supplementary results for Section 3 B.1 Proof of Theorem 2 Theorem 2 (Fixed-precision). For any N, any m Ω(d + q log N), any ϵ (0, 1), and p = Ω(log( q ϵ log N)), there exists some f A d,m,d ,p that ϵ-approximates q SA. Proof. Before explaining how they are produced by the input MLP, we introduce the corresponding key, value, and query inputs. The values will simply be ϕ(X)V = (z1, . . . , z N). For some m = m d 2 , let ϕ(X)K = (u1, . . . , u N) RN m be embedded key vectors, where u1, . . . , u N { 1/ m }m are the columns of a m N matrix satisfying the (q, 1/4)-restricted isometry and orthogonality property (Definition 5), as guaranteed to exist by Lemma 12 and the assumption on m . Let α := 2 log(4N/ϵ) . By Lemma 13, for each y [N] q , there exists wy Rm with wy 2 2 q satisfying ui , wy = 1 for all i y, | ui , wy | 1 2 for all i / y. Given the bounded precision of the model, we are not free to represent the vectors wy exactly. Under pbit precision for p sufficiently large, we there exists a vector of p-bit floating point numbers f wy Rm for every wy with wy 2 2 q satisfying f wy wy 2 ϵ 4α. As an immediate consequence, | ui , f wy ui , wy | ϵ 4α for all i and y (by Cauchy-Schwarz). The remainder of the proof demonstrates that the necessary properties of the argument hold even with this approximation. We now describe how to structure the neural network. We define an MLP ϕ : Rd Rm as ϕ(xi) = ϕ(zi; yi; i) = (zi; αg wyi; ui), which works simply by using a look-up table on the values of ui and g wyi from keys i and yi respectively. Then, we define Q, K, V as sparse boolean-valued matrices that simply copy their respective elements from ϕ(X). We analyze the output of the softmax. If i yi, then softmax(ϕ(X)QK Tϕ(X) T)i,i = exp(α ui, f wi ) P i yi exp(α ui, g wi ) + P i yi exp(α ui, g wi ) 4) q exp(α + ϵ 4) + N exp( α qeα + Neα/2 exp ϵ An analogous argument shows that softmax(ϕ(X)QK Tϕ(X) T)i,i 1 Likewise, if i yi, then softmax(ϕ(X)QK Tϕ(X) T)i,i exp( α 4) q exp(α ϵ We thus conclude that that we meet the desired degree of approximation for such m: f(X)i q SA(X)i 2 = q softmax(ϕ(X)QK Tϕ(X) T)i,i zi i yi (softmax(ϕ(X)QK Tϕ(X) T)i,i ) zi 2q + (N q) ϵ 2N ϵ. B.1.1 Restricted isometry and orthogonality property The proof relies on the restricted isometry and orthogonality property from the compressed sensing literature. For v RN, let supp(v) = {i [N] : vi = 0}. Definition 5. We say a matrix U Rm N satisfies the (q, δ)-restricted isometry and orthogonality property if Uv 2 2 [(1 δ) v 2 2, (1 + δ) v 2 2] and | Uv, Uv | δ v 2 v 2 for all vectors v, v RN with | supp(v)| q, | supp(v )| 2q, and supp(v) supp(v ) = . The first result shows the existence of a sign-valued matrix U that satisfies the desired distancepreserving property. Lemma 12 (Consequence of Theorem 2.3 of Mendelson et al. [2007] and Lemma 1.2 of Candes and Tao [2005]). There is an absolute constant C > 0 such that the following holds. Fix δ (0, 1/2) and q N. Let U denote a random m N matrix of independent Rademacher random variables scaled by 1/ m. If m C(q log N)/δ2, then with positive probability, U satisfies the (q, δ)-restricted isometry and orthogonality property. Sparse subsets of the columns of such a U can then be linearly separated from all other columns. Lemma 13 (Consequence of Lemma 2.2 in Candes and Tao [2005]). Fix δ (0, 1/2) and q N. Let matrix U = [u1, . . . , u N] Rm N satisfy the (q, δ)-restricted isometry and orthogonality property. For every vector v {0, 1}N with supp(v) q, there exists w Rm satisfying w 2 q/(1 2δ), ui, w = 1 if vi = 1, | ui, w | δ/(1 2δ) if vi = 0. B.2 Proof of Theorem 3 Theorem 3 (Infinite-precision). For fixed N, m Ω(d +q) and ϵ > 0, there exists some f A d,m,d that ϵ-approximates q SA. The proof relies on the properties of neighborly polytopes, which we define. Definition 6 (Ziegler [2006]). A polytope P is q-neighborly if every subset of q q vertices forms a (q 1)-face. We give a q-neighborly polytope below that we use for the construction. For vectors v1, . . . , v N Rm , let Conv(v1, . . . , v N) = {PN i=1 αivi : α [0, 1]N, P i αi = 1} denote their convex hull. Fact 14 (Theorem 1 of Gale [1963]). For t R, let θ(t) = (t, . . . , tm ) Rm . Then, for all distinct t1, . . . , t N R, the cyclic polytope Conv(θ(t1), . . . , θ(t N)) is m 2 -neighborly. The proof of Theorem 3 is immediate from the aforementioned fact and the following lemma. Lemma 15. Suppose there exists u1, . . . , u N Rm such that Conv(u1, . . . , u N) is q-neighborly. Then, for any ϵ > 0, there exists some f A d,m,d with fixed key vectors ϕ(X)K = (u1, . . . , u N) that ϵ-approximates q SA. Proof. The construction employs a similar look-up table MLP ϕ to the one used in the proof of Theorem 2. We let the key and value embeddings be ϕ(X)K = ((u1, 1), . . . , (u N, 1)) RN (m +1), and ϕ(X)V = (z1, . . . , z N) RN d. To set the query vectors, observe that for any face F of a polytope P, there exists a hyperplane HF such that F HF and P \ F lies entirely on one side of HF . Thus, for every y [N] q , there exists w y Rm and by R such that w T y ui + by = 1 if i y, < 1 otherwise. For α > 0, let ϕ(xi)TQ = αwy = α(w y, by). We construct the MLP to satisfy ϕ(xi) = (zk; wyi; ui; 1) Rm for m = 2m + 2 and set parameter weights accordingly. Following the softmax analysis of Theorem 3, a sufficiently large choice of α ensures that maxi [N] f(X)i q SA(X)i 2 ϵ. B.3 Proof of Theorem 4 Theorem 4. For any sufficiently large q, any N 2q + 1, and any d 1, there exists a universal constant c such that if mp cq, then no f T 1,1 d,m,d ,p exists that 1 2q-approximates q SA. Proof. We first embed every instance of DISJ with n = q into an instance of q SA and prove that they correspond. We assume the existence of the a transformer f T 1,1 d,m,d ,p that 1 2q-approximates q SA and implies the existence of an O(mp)-bit communication protocol that computes DISJ. An application of Fact 5 concludes the proof. Consider an instance of DISJ with a {0, 1}q and b {0, 1}q known by Alice and Bob respectively. We design an instance X = (zi; yi; i)i [N] of q SA. For each j [2q], let y2q+1 = {2i + ai 1 : i [q]}. Additionally, let zj = e1 if j is odd and b(j 1)/2 = 1, e1 otherwise. All other inputs are set arbitrarily. Then, q SA(X)2q+1 = 1 j [2q] : j y2q+1, j is odd, and a(j 1)/2 = 1 e1 j [2q] : j y2q+1 and (j is even or a(j 1)/2 = 0) e1 = |{i [q] : aibi = 1}| |{i [q] : aibi = 0}| Hence, q SA(X)2q+1 = e1 if and only if DISJ(a, b) = 0. It remains to show that this implies the existence of an efficient communication protocol that computes DISJ(a, b). By the existence of f, there exist Q, K, V : Rd Rm and ψ : Rm Rd such that f(X)2q+1 = ψ PN i=1 exp (Q(x2q+1)TK(xi)) V (xi) PN i=1 exp (Q(x2q+1)TK(xi)) The protocol is as follows: 1. From a, Alice determines y2q+1 and then computes Q(x2q+1) Rm, which she sends to Bob. This transmission uses O(mp) bits. 2. Bob determines z1, . . . , z2q from b. Using those and the information from Alice, he computes f(X)2q+1. He returns 1 if and only if f(X)T 2q+1e1 1 + 1 The protocol computes DISJ(a, b) because f is a 1 2q-approximation of q SA. Because any such protocol requires sharing Ω(q) bits of information, we conclude that mp cq for some c. B.4 Optimality of Theorem 3 under restricted architectures While the near-optimality of the bounded-precision self-attention construction in Theorem 2 is assured by the communication complexity argument of Theorem 4, it is not immediately apparent whether Theorem 3 is similarly optimal among infinite-precision self-attention models. Theorem 16 proves that this is indeed the case for a restricted family of architectures that resembles cross-attention rather than self-attention. Theorem 16. For input x1, . . . , x N satisfying xi = (zi; yi; i), suppose ϕ(xi)TQ = w(yi, i), ϕ(xi)TK = u(i), and ϕ(xi)TV = zi. Then, for any q < N and m q(1 C log N q) for some universal C, there do not exist w : Rd [N] Rm and u : [N] Rm such that the resulting self-attention unit 1 2q-approximates q SA. The architectural assumptions of this statement are strong. For each element xi = (zi; yi; i), its value embedding must reproduce its target zi; its key embedding depends exclusively on the index i; and its query embedding only on the indices yi and i. Indeed this attention unit more closely resembles cross-attention rather than self-attention, in which the problem is formulated as two sequences ((z1, 1), . . . , (z N, N)) and (y1; 1), . . . , (y N; N) that are passed to the key and value inputs and the query inputs respectively. We leave open the problem of generalizing this result to include all infinite-precision cross-attention or self-attention architectures, but we note that the constructions in Theorems 2 and 3 can be implemented under such architectural assumptions. The proof relies on a geometric argument about how the convex hull of fixed key embeddings U = (u(1), . . . , u(N)) lacks neighborliness and hence cannot separate every size-q subsets of values embeddings z1, . . . , z N from the other values. Proof. It suffices to show that for any fixed key embedding U, there exists some yi and setting of z1, . . . , z N such that (softmax(w(X)U T)Z)i 1 where w(X) = (w(y1, 1), . . . , w(y N, N)) RN m and U = (u(1), . . . , u(N)) RN m. By Fact 17, for some y1 [N] q , there are no w and τ R satisfying w(y1, 1)Tui τ if and only if i y1. Hence, for any fixed w, there exists i1 y1 and i2 [N] \ y1 such that w(y1, 1)Tui2 > w(y1, 1)Tui1. Given the value embeddings zi1 = e1, zi2 = e2 and zi = e3 for all i {i1, i2}, we have (softmax(w(X)U T)Z)1 1 softmax(w(X)U T)Z)1,i1 1 2 + (softmax(w(X)U T)Z)1,i2)2 softmax(w(X)U T)Z)1,i1 1 2 , softmax(w(X)U T)Z)2 1,i1 Fact 17. If m < q(1 log N Cq), then the columns of any U = (u1, . . . , u N) RN m can be partitioned into sets U1 and U2 with |U1| = q that are not linearly separable. Hence, Conv(u1, . . . , u N) is not q-neighborly. Proof. By the Sauer-Shelah Lemma [Sauer, 1972, Shelah, 1972, Vapnik and Chervonenkis, 1968] and the fact that the VC dimension of m -dimensional linear thresholds is m + 1, the maximum number of partitions of the columns of U that can be linearly separated is at most C N m +1 < C N q for a sufficiently large choice of C given universal constant C . If the fact were to be false, then at least N q ( N q )q such partitions must exist, which contradicts the above bound. C Supplementary results for Section 4 C.1 Proof of Theorem 6 Theorem 6. For any input size N, input range M = N O(1), and fixed-precision bit complexity p = O(log M), there exists a transformer architecture f T 1,1 1,m,1,p with a single self-attention unit with embedding dimension m = 3 such that for all X [M]N, f(X) = Match2(X). Proof. As discussed in Section 2.1, we allow a single blank token to be appended to the end of the sequence X and assume the existence of a positional encoding. That is, we consider input X = (x1, . . . , x N, x ) with xi,0 = i and x = 0 to be the input to the target attention model. We define input MLP ϕ : R R3 and parameterizations Q, K, V R3 3 such that Q Tϕ(xi) = c cos 2πxi K Tϕ(xi) = cos 2πxi V Tϕ(xi) = 1, QTϕ(x ) = 0, KTϕ(x ) = e3, and V Tϕ(x ) = 0. By elementary trigonometric identities, the following is true about the corresponding inner products: (Q Tϕ(xi)) TK Tϕ(xj) = c cos 2π(xi + xj) (Q Tϕ(xi)) TK Tϕ(x ) = cd. As a result, (QTϕ(xi))TKTϕ(xj) = cd if and only if xi + xj = 0(mod M). Otherwise, (QTϕ(xi))TKTϕ(xj) c(1 1 M 2 ). (Here, the O(log M)-bit fixed-precision arithmetic is sufficient to numerically distinguish the two cases.) For each i [N] let βi = |{j [N] : xi + xj = 0 (mod M)}| represent the total number of matches the input belongs to. If we take c = M 2 log(6N), then (softmax(ϕ(X)QK Tϕ(X) T))i,j [0, 1 6N ] if xi + xj = 0 (mod M) and i, j [N]; [ 1 βi+1 1 6N ] if xi + xj = 0 (mod M) and i, j [N]; [ 1 βi+1 1 6N ] if i [N], j = N + 1. We conclude that for any i [N], (softmax(ϕ(X)QK Tϕ(X) T)V ϕ(X))i 6 1 if j s.t. xi + xj = 0 (mod M) 6 1 if j s.t. xi + xj = 0 (mod M), where is a partial ordering with v v if vi v i for all i. Since the latter case holds only when βi 1, the final step of the proof is design an output MLP ψ such that ψ(z) = 1 if z 1 3 and ψ(z) = 0 if z 1 6, which can be crafted using two Re LU gates. C.2 Proof of Theorem 7 Theorem 7. There is universal constant c > 0 such that for sufficiently large N, and any M N +1, if mp H c N/ log log N, then there is no f T 1,H 1,m,1,p satisfying f(X) = Match3(X) for all X [M]N. Proof. The proof relies on a reduction to Fact 5 that embeds inputs to the set-disjointness problem of cardinality n = N 1 2 into a subset of instances passed to Match3. For the sake of simplicity, we assume in the construction that N is odd; if it were not, we could replace it with N 1 and set the final element such that it never belongs to a triple. We consider the following family of inputs to Match3: {0} if i = 1, {1, i} if i {2, . . . , N+1 2 }, {1, (M i + N 1 2 )} if i { N+3 2 , . . . , N}. (3) Note that Match3(X)1 = 1 if and only if there exists i {2, . . . , N+1 2 } such that xi = i and xi+ N 1 2 = (M i). Given input (a, b) {0, 1}n {0, 1}n to DISJ, let xi+1 = 1 if and only if ai = 0, and let xi+ N+1 2 = 1 if and only if bi = 0. Then, Match3(X)1 = 1 iff DISJ(a, b) = 1. Suppose f(X) = Match3(X) for all X [M]N for some f T 1,H 1,m,1,p. We show that f simulates an O(mp H)-bit communication protocol for testing DISJ. By definition of the standard self-attention unit with multi-layer perceptrons, note that f(X)1 = ψ(PH h=1 fh(ϕ(X))) for ϕ : R Rm, ψ : Rm {0, 1}, and fh(X) = PN i=1 exp(Qh(x1)TKh(xi))Vh(xi) PN i=1 exp(Qh(x1)TKh(xi)) , for Qh, Kh, Vh : Rm m. If we assume that this construction exists and is known explicitly by both Alice and Bob, we design a communication protocol for Alice and Bob to solve DISJ by sharing O(mp H) bits with one another. Let Alice possess a {0, 1}n and Bob b {0, 1}n, with n = N 1 1. Alice and Bob compute (x2, . . . , x N+1 2 ) and (x N+3 2 , . . . , x N) from a and b respectively. 2. Alice computes an O(p log log N)-bit approximation of the logarithm of the first half of the softmax normalization term for each attention head and sends the result to Bob. That is, she sends Bob i=1 exp(Qh(ϕ(x1)) TKh(ϕ(xi))) for each h [H]. This requires transmitting O(p H log log N) bits. 3. Bob finishes the computation of normalization terms exp(Lh,a) + exp(Qh(ϕ(x1)) TKh(ϕ(xi))) for each h and sends the result back to Alice (up to O(p log log N)-bits of precision). This again requires transmitting O(p H log log N) bits. 4. Alice computes the partial convex combination of the first N+1 2 value vectors stipulated by the attention matrix Sh,a = P N+1 2 i=1 exp(Qh(ϕ(x1)) Kh(ϕ(xi)))Vh(ϕ(xi)) for each h and sends the partial combinations to Bob. This requires transmitting O(mp H log log N) bits (using the same precision as above). 5. Bob finishes the computation of the convex combinations fh(X) = Sh,a + 2 exp(Qh(ϕ(x1)) Kh(ϕ(xi)))Vh(ϕ(xi)) exp(Lh) Rm. Bob concludes the protocol by computing and outputting f(X)1, using his knowledge of each fh(X) and of ψ. By the equivalences previously established, Bob returns 1 if and only if DISJ(a, b) = 1. Because the protocol requires O(mp H log log N) bits of communication, we can only avoid contradicting Fact 5 if mp H Ω(n/ log log N) = Ω(N/ log log N). Remark 1. The domain restrictions to Match3 stipulated in Equation (3) make the Match3 problem substantially easier to solve than the full-domain case. Indeed, under the domain restrictions, Match3(X)1 = max i {2,..., N+1 2 } Match2(X)i, which is computable by a two-layer single-headed transformer network with constant embedding dimension. The first layer computes each Match2(X)i with the construction in the proof of Theorem 6, and the second computes the maximum of the previous outputs by using those outputs as key vectors. While Informal Conjecture 1 suggests that two layers are insufficient to compute the full-domain version of Match3, this restricted variant introduces a concise depth separation (see Eldan and Shamir [2016], Telgarsky [2016], Daniely [2017]) between oneand two-layer transformer models. C.3 Higher-order tensor attention We introduce a novel category of higher-order tensor-based transformer models in order to show that problems like Match3 that are hard to compute with standard transformer models can be made solvable. An s-order transformer is designed to efficiently compute dense s-wise interactions among input elements in an analogous manner to how standard transformers compute pairwise interactions. (We think of a standard transformer as second-order.) Before defining the new type of attention, we introduce notation to express the needed tensor products. For vectors v1 RN1 and v2 RN2, let v1 v2 RN1N2 denote their Kronecker product by (v1 v2)(i1 1)N2+i2 = v1 i1v2 i2. The column-wise Kronecker product of matrices A1 RN1 m and A2 RN2 m is A1 A2 = [A1 1 | | A1 m] [A2 1 | | A2 m] = [A1 1 A2 1 | | A1 m A2 m] RN1N2 m. The following generalizes the definition of self-attention. Definition 7. For order s 2, input dimension d, output dimension d , embedding dimension m, bit complexity p, and matrices Q, K1, . . . , Ks 1 Rd m and V 1, . . . , V s 1 Rd d (encoded with p-bit fixed-point numbers), an s-order self-attention unit is a function f Q,K,V : RN d RN d f Q,K,V (X) = softmax( XQ |{z} RN m ((XK1) (XKs 1)) T | {z } Rm Ns 1 ) ((XV 1) (XV s 1)) | {z } The input to the row-wise softmax is an N N s 1 matrix. Let A s d,m,d ,p denote the set containing all such attention units. Note that A 2 d,m,d ,p = Ad,m,d ,p. Because s-order self-attention units have the same domain and codomain as standard self-attention, multiple units can be analogous combined to construct multiheaded attention units and full transformer models. We define AM, s d,m,d ,p and T D,H, s d,m,d ,p accordingly. The purpose of the s-order transformer model as a theoretical construct is to posit how strictly generalizing the architecture in order to permit higher order outer products transfers the expressive powers of standard transformer architectures to more sophisticated interactions among elements of the input sequence X. The model is not defined to be immediately practical, due to its steep computational cost of evaluation. However, the trade-offs involved in using such architectures resemble those already made by using transformer models instead of fully-connected networks. Transformers are already computationally wasteful relative to the number of the parameters, and these models likely succeed only because extremely efficient factorized parameterization exist. Likewise, third-order transformers could indeed be practical if even more factorization proves useful, since the computational costs may prove mild if the embedding dimension m, number of heads H, and depth D necessary to succeed on a task exceed the sequence length N for standard second-order transformers. C.4 Efficient representation of Match3 with third-order self-attention Theorem 18 (Match3 construction with third-order self-attention). For any sequence length N, input range M = N O(1), and fixed-precision bit complexity p = O(log M), there exists a third-order transformer architecture f T 1,1, 3 1,m,1,p with a single self-attention unit with embedding dimension m = 5 such that for all X [M]N, f(X) = Match3(X). Proof of Theorem 18. The proof is almost identical to that of Theorem 6, except that we instead use a different key and query transforms to express a different trigonometric function: Qϕ(xi) = c cos 2πxi K1ϕ(xi) = cos 2πxi K2ϕ(xi) = cos 2πxi Together, these ensure that the resulting tensor products reduce to a trigonometric expression that is maximized when xi + xj1 + xj2 = 0 (mod M). That is, (ϕ(X)Q((ϕ(X)K1) (ϕ(X)K2)) T)i,(j1 1)+j2 = c cos 2π(xi + xj1 + xj2) We similarly let V 1ϕ(xi) = V 2ϕ(xi) = 1 and V 1ϕ(x ) = V 2ϕ(x ) = 0. The remaining choice of c and the output MLP, and the analysis of the softmax proceeds identically to the previous proof. C.5 Heuristic argument for Informal Conjecture 1 Conjecture 19 (Formal version of Informal Conjecture 1). For sufficiently large N and any d 1, for all M N + 1 and mp HD N Ω(1), there is no f T D,H 1,m,1,p satisfying f(X) = Match3(X) for all X [M]N. We believe that the conjecture holds due to a heuristic information-theoretic argument. Define the distribution D over inputs X RN that will be used to show that the model cannot compute Match3 for M = N 4 with high probability. We draw X from D as follows: (E1) With probability 1 2, draw each xi iid from Unif([M]). (E2) With probability 1 2, draw j1, j2, j3 iid from Unif( [N] 3 ). For all i = j3, draw each xi iid from Unif([M]). Let xj3 = xj1 xj2 (mod M). Note that under event E1, a three matching elements exist with probability at most 1 Pr h Match3(X) = 0 | E1 i 1 1 Under event E2, a triple of matching elements is always planted, so Match3(X) = 0. It would suffice to prove that unless a transformer is sufficiently large it is impossible to determine whether Match3(X) = 0 with probability at least 0.9. Under D, any subset of {x1, . . . , x N} consists of iid integers drawn uniformly from [M], unless all of xj1, xj2, xj3 appear in the subset. Consider a transformer architecture with p-bit precision, mdimensional embeddings, H heads per layer, and D layers. We argue informally that a single-element output of a self-attention unit can take into account information about mp more inputs x1, . . . , x N than that it had in the previous layer. By induction, after D layers of H-headed self-attention with interleaved MLPs, each element is a function of at most mp HD inputs. Until an element exists that is a function of at least two of the three of xj1, xj2, xj3, we assume that the elements known by each output are chosen independently of the indices j1, j2, j3. (Given two elements of the triple, the third element can be identified with a single self-attention unit.) Hence, we argue that it suffices to show that the probability any two elements of the triple j1, j2, j3 occurring within any of the N sets of mp HD inputs is vanishingly small for sufficiently large transformer parameters. The probability of single collection having any of two of the three inputs is at most N 2 3 emp HD Thus, the probability that any collection has all three inputs is no more than 3(emp HD)2/N. If mp HD = O( N), then the randomly chosen triple will not jointly appear as the outcome of a single element of a self-attention unit with probability at least 0.9, and the transformer will be unexpected to successfully distinguish between the two cases. Should the conjecture hold, it would represented a tight lower bound on the size of the smallest standard transformer architecture necessary to compute Match3. Theorem 20 (Tightness of Conjecture 19). For any sequence length N, if the input range satisfies M = N O(1) and the transformer size parameters satisfy p log(M), H = 1, m 4, and m D CN 2 for some universal constant C, then there exists a transformer architecture f T D,H 1,m,1,p such that f(X) = Match3(X). Proof. We construct an architecture that collects a group of candidate pairs in each layer of singleheaded self-attention and verifies whether there exists a triple incorporating each pair that satisfies the summation property. Then, all candidate triples are disposed of, and the subsequent layer collects a new family of candidates. To do so, we first let ℓ:= m 2 1 1 represent the total number of pairs shared in each layer of attention. We let P = [N] 2 represent a collection of all pairs of indices and partition it into D subsets P1, . . . , PD, each containing ℓdistinct pairs. (Since |P| = N(N+1) 2 , any D satisfying the theorem s preconditions is sufficiently large for this to be a proper partition.) Our construction ensures that there exist xi + xj1 + xj2 = 0 (mod M) for (j1, j2) Pk, then the kth layer of self attention will verify its existence and mark xi as belonging to the match. Throughout the network, we maintain that the first two dimensions of any embedding of the ith element correspond to xi [M] and a bit indicating whether a match has been found yet containing xi. Consider the first layer of self-attention, and let P1 = {(i1, j1), . . . , (iℓ, jℓ)}. We set the input MLP ϕ1 : Rd Rm and respective matrices Q1, K1 Rm m such that Q1ϕ1(xi) = ce1 and K1ϕ1(xi) = e1 if i P1 0 otherwise, for sufficiently large c. We additionally let V 1ϕ1(xi) = (2ℓ+ 1) (xi; 0; 0) i P1, (2ℓ+ 1) (xi; 0; xie2ι 1) i = iι, (2ℓ+ 1) (xi; 0; xie2ι) i = jι. By making use of a residual connection, we ensure that the ith outcome of the self-attention is (xi, 0, xi1, xj1, . . . , xiℓ, xjℓ). We encode an MLP to compute (xi, 0, xi1, xj1, . . . , xiℓ, xjℓ) 7 xi, 1 n ι [ℓ] s.t. xi + xiι + xjι = 0 (mod M) o ; 0 . We repeat this construction D times, with the only modifications being the replacement of P1 and the fact that the second dimension of the embedding remains 1 after being set to that value. After D layers, the final MLP outputs the value of the second dimension, which will be 1 if and only if the respective xi belongs to a three-way match. C.6 Sharper separations for embedded subgraph detection problems In pursuit of proving separations analogous to the one between Theorem 18 and Conjecture 19, we draw techniques for proving lower bounds for graph problems in the CONGEST model of distributed computation with restricted bandwidth [Peleg, 2000].5 5At a high level, the CONGEST model features N players that communicate in synchronous rounds over a network (an undirected graph with [N] as its vertices) to solve a computational problem Peleg [2000]. In each round, each player can send a message to each of its neighbors. The computation that each player does with the messages received from its neighbors is unrestricted; the primary resources considered in CONGEST is the number of rounds of communication and the message sizes. Although CONGEST is often studied for solving computational problems on input graphs with vertices [N], the input graph need not be the same as the communication network. The problems we consider take, as input, the adjacency matrix X {0, 1}N N of an N-vertex graph G = (V, E) with V = [N], so xi,j = 1 {(i, j) E}. We may regard each row of X as a highdimensional (d = N) embedding of the i-th vertex containing information about which (outgoing) edges are incident to the i-th vertex. We consider the following problems: Directed Cycle3(X) = (1 { j1, j2 [N] s.t. xi,j1xj1,j2xj2,i = 1})i [N] ; Cycle5(X) = (1 { j1, j2, j3, j4 [N] s.t. xi,j1xj1,j2xj2,j3xj3,j4xj4,i = 1})i [N] , with dom(Cycle5) = {X : X = X T}. The former treats X as a directed graph (where X need not be symmetric) and asks whether each input belongs to a directed 3-cycle. The latter insists that X be an undirected graph by enforcing symmetry and determines membership in (undirected) 5-cycles. However, solving these problems with any transformer model of constant order trivially requires having the product of the precision p, embedding dimension m, heads per layer H, and depth D grow polynomially with N, since each attention unit is limited to considering at most pm bits of information from each input. Such a lower bound is not interesting for dense graphs, where every vertex may have Ω(N) incident edges; the bottleneck is not due to any feature of standard attention units (and would persist with higher-order attention). To circumvent this issue, we consider an augmented self-attention unit, which permits each element of the self-attention tensor to depend on both its respective inner product and on the presence of edges among corresponding inputs. Definition 8. For order s 2, input dimension d, output dimension d , embedding dimension m, bit complexity p, matrices Q, K1, . . . , Ks 1 Rd m and V 1, . . . , V s 1 Rd d (encoded with p-bit fixed-point numbers), and cell-wise attention tensor function κ : {0, 1}s(s 1) R R, an s-order graph self-attention unit is a function f Q,K,V : RN d RN d with f Q,K,V (X) = softmax(κ(X, XQ((XK1) (XKs 1)) T))((XV 1) (XV s 1)). For attention tensor A RN s, we abuse notation by writing κ(X, A) as short-hand for the particular cell-wise application of a fixed function, incorporating information about all relevant edges: κ(X, A)i1,...,is = κ(xi1,i2, xi1,i3, . . . , xis,is 1, xis,is 2, Ai1,...,is). Let AG s d,m,d ,p and T GD,H, s d,m,d ,p denote all such attention units and all such transformers respectively. Now, we provide four results that exhibit separations between orders of graph self-attention. Theorem 21 (Hardness of representing Cycle5 with standard graph transformer). For sufficiently large N, any f T GD,H N,m,1,p satisfying f(X) = Cycle5(X) for all X {0, 1}N N with X = X T requires mp HD = Ω(N/ log2 N). Theorem 22 (Efficient construction of Cycle5 with fifth-order graph transformer). For sequence length N and bit-complexity p = O(log N), there exists a fourth-order graph transformer architecture f T G1,1, 5 N,1,1,p with a single graph self-attention unit such that for all X {0, 1}N N with X = X T, f(X) = Cycle5(X). Theorem 23 (Hardness of representing Directed Cycle3 with standard graph transformer). For sufficiently large N, any f T GD,H N,m,1,p satisfying f(X) = Directed Cycle3(X) for all X {0, 1}N N requires mp HD = Ω(N/ log2 N). Theorem 24 (Efficient construction of Directed Cycle3 with fourth-order graph transformer). For sequence length N and bit-complexity p = O(log N), there exists a third-order graph transformer architecture f T G1,1, 3 N,1,1,p with a single graph self-attention unit such that for all X {0, 1}N N, f(X) = Directed Cycle3(X). The proofs of Theorems 22 and 24 are immediate from the construction. Because each cell of the self-attention tensor has explicit access the the existence of all relevant edges, κ can be configured to ensure that cell s value is large if and only if the requisite edges for the desired structure all exist. Taking a softmax with a blank element (like in Theorem 6) ensures that the outcome of the Figure 4: The CONGEST graph GN visualized for N = 6 with root nodes {ui}i [N] in blue, leaf nodes {vi,j}i,j [N] in green, and the nodes V1 of the binary tree B1 shaded red and edges E1 colored red. self-attention unit for a given element distinguishes between whether or not it belongs to a 5-cycle or a directed 3-cycle. The output MLP ensure that the proper output is returned. We prove Theorems 21 and 23 by introducing a particular CONGEST communication graph that can be used to simulate any model in T GD,H d,m,d ,p (and hence, also any model in T D,H d,m,d ,p) in O(m HD log N) rounds of communication. Then, we show for each problem that we can encode each instance of the set disjointness communication problem as an instance of Cycle5 (or Directed Cycle3) and derive a contradiction from the communication graph. C.6.1 A CONGEST communication graph that generalizes standard graph transformer computation The key principle of our analysis is that the predominant limitation of a transformer model is in its communication bandwidth and not its computational abilities. We model transformers as having element-wise multi-layer perceptron units with unbounded computational ability (but bounded precision inputs and outputs) and self-attention units, which compute linear combinations of inputs in a carefully regimented way that limits the ability of individual elements to share information with one another. Here, we introduce a specific CONGEST graph for each sequence length N and show that every transformer has a communication protocol that simulates its computation in this graph. For fixed N, we design an undirected CONGEST graph GN = (V N, EN) with O(N 2) nodes, each having degree at most 3. (Note that this graph is not the same as the graph provided as input X to a transformer; this graph is consistent across all transformers taking input of sequence size N.) Let u1, . . . , u N be nodes in V N corresponding to each input. For every pair i, j [N], let vi,j be a node as well. For each i [N], let Bi = (Vi, Ei) be a balanced binary trees having root ui and leaves vi,1, . . . , vi,N, v1,i, . . . , v N,i. Hence, each Bi has O(N) vertices of degree 3 and is of depth O(log N). Let V N = V1 VN and EN = E1 EN. Noting that E1, . . . , EN are disjoint and that V1, . . . , VN are disjoint, except for leaves vi,j, we ascertain that GN contains O(N 2) vertices of degree at most 3 and has diameter O(log N). We visualize the graph GN with a highlighted tree B1 in Figure 4. Lemma 25. For any transformer f T GD,H d,m,d ,p and any X RN d with p-bit fixed-precision numbers, there exists a CONGEST communication protocol on the graph GN that shares p bits of information between adjacent vertices per round satisfying the following characteristics: Before any communication begins, each node ui is provided with xi and each node vi,j is provided with xi,j and xj,i. After T = O(HD(m + log N)) rounds of communication, each node ui outputs f(X)i. Proof. It suffices to give a protocol that computes the outcome of a single-headed unit of graph self-attention with parameters Q, K, V Rm m and κ : { 1, 1}2 R R and transmits its ith output back to ui in O(m log N) rounds of p-bit communication. The remainder of the argument involves computing the outcomes of all element-wise MLPs within respective vertices u1, . . . , u N (since we assume each node to have unbounded computational power in the CONGEST model) and to repeat variants of the protocol HD times for every individual self-attention unit. Because the protocol is designed for a particular transformer architecture f, we can assume that every node in the CONGEST graph has knows every parameter of f. We give the protocol in stages. We assume inductively that every input to f, y1, . . . , y N Rm, is known by its respective vertex u1, . . . , u N. 1. Every vertex ui computes QTyi Rm and propagates it to every vertex vi,1, . . . , vi,N. This can be done in O(m + log N) rounds by transferring one p-bit fixed-precision number per round from an element of the binary tree Bi to each of its children per round. Because the respective edges E1, . . . , EN are disjoint, this operation can be carried out in parallel. 2. Each ui computes KTyi, V Tyi Rm and propages them to v1,i, . . . , v N,i in O(m + log N) rounds. 3. Each vi,j, using their knowledge of xi,j and xj,i, computes αi,j := exp(κ(xi,j, xj,i, y T i QKTyj)). This takes zero rounds. 4. Each ui computes PN j=1 αi,j by propagating each αi,j in vi,j up Bi to ui, iteratively summing terms passed up. This takes O(log N) rounds. 5. Similarly, ui computes PN j=1 αi,j V Tyj in O(m log N) rounds. Then, it computes PN j=1 αi,j V Tyj PN j=1 αi,j , which is the target output of the self-attention unit. Because all steps are achievable in parallel with O(m + log N) rounds, the claim follows. C.6.2 Reduction from set disjointness Before proving Theorems 21 and 23 by embedding an instance of a transformer model into an instance of each subgraph identification problem, we first introduce a partition of the vertices V N of the CONGEST graph into those possessed by Alice and Bob for use in a two-party communication protocol. We call those two sets V N a and V N b . Note that the previous section made no assumptions about the organization of edges in the binary tree. We thus add an additional condition: that each binary tree Bi can be oriented to respect the left-to-right ordering vi,1, v1,i, . . . , vi,N, v N,i. Let ui V N a if and only if i N 2 , and vi,j V N a if and only if min(i, j) N 2 . We label are remaining nodes in Bi by labeling a parent node wp as a function of its child nodes wℓand wr using the following rules: (a) If wℓ, wr V N a , then let wp V N a . (b) If wℓ, wr V N b , then let wp V N b . (c) Otherwise, let wp V N a if and only if root ui V N a . This partition, which we visualize in Figure 5, bounds the number of bits Alice and Bob can exchange by simulating a protocol on CONGEST graph GN. Figure 5: The CONGEST graph GN with vertices partitioned into sets V N a (violet) and V N b (orange) for N = 6. The six edges cut by the partition are colored red. Lemma 26. Suppose Alice and Bob simulate an R-round p-bit protocol on CONGEST communication graph GN where Alice has access to all vertices V N a and Bob V N b . No other communication is permitted besides sharing bits as permitted by the CONGEST protocol between neighboring vertices. Then, Alice and Bob exchange at most O(p RN log N) bits. Proof. It suffices to show that the partition V N a , V N b induces a cut of size at most O(N log N); this ensures that each can send no more than O(p N log N) bits per round. Per the rules defined above, an edge in (wp, wℓ) and (wp, wr) is cut if and only if they are described by case (c). Within each tree Bi under the orientation described above, an inductive argument shows that in every layer, all elements in V N a are to the left of all elements in V N b . Thus, there exists at most one parent of that layer that belongs to case (c), and thus, no more than one cut edge per layer. Because each tree has O(log N) layers and because there are N trees, the partition cuts at most O(N log N) edges. It remains to embed an instance of DISJ in V N a , V N b for each problem such that its output corresponds identically with that of DISJ. Proof of Theorem 21. Assume for the sake of simplicity that N is divisible by 5. Let a, b {0, 1}n for n = N 2 25 be an input to DISJ, and let Alice and Bob possess a and b respectively. We index those vectors as a = (a1,1, a1,2, . . . , a N/5,N/5 1, a N/5,N/5) and b = (b1,1, . . . , b N/5,N/5) for ease of analysis. We design input matrix X {0, 1}N N as follows: 5 ] and j ( N 5 ], then xi,j = xj,i = ai,j N/5. 5 ] and j ( 2N 5 ], then xi,j = xj,i = δi,j N/5. 5 ] and j ( 4N 5 , N], then xi,j = xj,i = bj 4N/5,i 3N/5. 5 , N] and j (0, N 5 ], then xi,j = xj,i = δi,j+4N/5. Otherwise, xi,j = 0. This ensures that X has a 5-cycle if and only there exist i, j (0, N 5 ] such that ai,jbi,j = 16. In addition, note that under the protocol in Lemma 25, Alice s and Bob s inputs a and b are known exclusively by nodes belonging to V N a and V N b respectively. Consider any transformer architecture f T GD,H N,m,1,p that computes Cycle5. By Lemma 25, there exists a protocol on the CONGEST graph GN that computes Cycle5 after O(HD(m+log N)) rounds of communication of p-bits each. If Alice and Bob simulate this protocol, and output 1 if and only if at least one of their outputs indicates the existence of a Cycle5, then they successfully decide DISJ. By Lemma 26, this communication algorithm solves DISJ after exchanging O(mp HDN log2 N) bits of communication. However, Fact 5 implies that no communication algorithm can do so without exchanging Ω(n) = Ω(N 2) bits, which concludes the proof. Proof of Theorem 23. The proof is identical to its predecessor, but uses a different embedding of an instance a, b {0, 1}n to DISJ. Let n = N2 4 ] and j ( N 4 ], then xi,j = ai,j N/2. 4 ] and j ( 3N 4 , N], then xi,j = bj 3N/4,i N/2. 4 , N] and j (0, N 4 ], then xi,j = δi,j+3N/4. Otherwise, xi,j = 0. This construction ensures that a directed 3-cycle exists if and only if a corresponding pair of elements in a and b are both 1. D Experiment details This section describes the experimental setup behind Figure 2, and provides further experiments suggesting an implicit bias of transformers for q SA, in particular when compared with MLPs and RNNs. Experimental setup. Experiments used synthetic data, generated for q SA with n = 1000 training and testing examples, a sequence length N = 20, q = 3, with the individual inputs described in more detail as follows. The positional encoding of element i is a random vector sampled uniformly from the sphere in Rd0 with d0 := 1 + 2 ln(N) , a quantity which agrees with the theory but was not tuned. A sequence element then consists of the data portion z Rd1 where d1 = 4, also sampled from the unit sphere, then the positional encoding of this sequence element, and then q further positional encodings identifying elements to average to produce the output; this differs from (and is more tractable than) the presentation in Section 3, where the positional encoding is provided as an integer and the MLP layer input to our attention layers is expected to choose a sufficient positional encoding. As such, the total dimension of a sequence element is d1 + (q + 1)d0 = 32. The architectures are detailed as follows. The attention is identical to the description in the paper body, with the additional detail of the width and embedding dimension m being fixed to 100. Figure 6 also contains an MLP, which first flattens the input, then has a single hidden Re LU layer of width 256, before a final linear layer and an output reshaping to match the desired output sequence shapes. Figure 6 also contains an LSTM, which is a standard pytorch LSTM with 2 layers and a hidden state size 800, which is 200 times larger than the target output dimension 4. 6We consider 5-cycles rather than 4-cycles because a spurious 4-cycle could exist among edges {xi,j : i (0, N 0 5 10 15 20 25 30 35 40 Attention (training) Attention (testing) LSTM (training) LSTM (testing) MLP (training) MLP (testing) Figure 6: Test and train error curves of fitting various architectures to q SA, where the horizontal axis denotes thousands of training iterations, and the vertical axis denotes the regression objective; see Section D for further details. Experiments fit the regression loss using Adam and a minibatch size of 32, with default precision, and take a few minutes to run on an NVIDIA TITAN XP, and would be much faster on standard modern hardware. Further discussion of Figure 2 and Figure 7. In Figure 2 and Figure 7, we plot (post-softmax) alignment matrices after T {0, 1000, 40000} iterations of Adam. The alignment matrices in Figure 2 are taken from the training example whose loss is the median loss across all examples. Figure 7 is similar, but additionally shows the examples of minimal and maximal loss. Further discussion of Figure 6. Figure 6 plots training and testing error curves for the same attention architecture as in Figure 2, but with further MLP and LSTM architectures as described above. but also an MLP trained on flattened (vectorized) error bars reflect 5 separate training runs from random initialization. A few variations of these architectures were attempted, however curves did not qualitatively change, and in particular, only the attention layer achieves good generalization across all attempts. (a) Min loss, T = 0. (b) Min loss, T = 1000. (c) Min loss, T = 40000. (d) Median loss, T = 0. (e) Median loss, T = 1000. (f) Median loss, T = 40000. (g) Max loss, T = 0. (h) Max loss, T = 1000. (i) Max loss, T = 40000. Figure 7: Alignment plots as in Figure 2, but using examples with minimum, median, and maximum loss, whereas Figure 2 only uses the example with median loss.