# transformers_parallel_computation_and_logarithmic_depth__4ed9d918.pdf

Transformers, Parallel Computation, and Logarithmic Depth

Clayton Sanford 1 Daniel Hsu 1 Matus Telgarsky 2

We show that a constant number of self-attention layers can efficiently simulate and be simulated by a constant number of communication rounds of Massively Parallel Computation, a popular model of distributed computing with wideranging algorithmic results. As a consequence, we show that logarithmic depth is sufficient for transformers to solve basic computational tasks that cannot be efficiently solved by several other neural sequence models and sub-quadratic transformer approximations. We thus establish parallelism as a key distinguishing property of transformers.

1 Introduction

The transformer (Vaswani et al., 2017) has emerged as the dominant neural architecture for many sequential modeling tasks such as machine translation (Radford et al., 2019) and protein folding (Jumper et al., 2021). Reasons for the success of transformers include suitability to modern hardware and training stability: unlike in recurrent models, inference and training can be efficiently parallelized, and training is less vulnerable to vanishing and exploding gradients. However, the advantages of transformers over other neural architectures can be understood more fundamentally via the lens of representation, which regards neural nets as parameterized functions and asks what they can efficiently compute.

Many previous theoretical studies of transformers establish (approximation-theoretic and computational) universality properties, but only at large model sizes (Yun et al., 2020; Pérez et al., 2021). These results are not unique to transformers and reveal little about which tasks can be solved in a size-efficient manner. Several other works (e.g., Hahn, 2020; Merrill & Sabharwal, 2022; Sanford et al., 2023) give fine-

1Department of Computer Science, Columbia University, New York, NY, USA 2Courant Institute, New York University, New York, NY, USA. Correspondence to: Clayton Sanford <clayton.h.sanford@gmail.com>.

Proceedings of the 41 st International Conference on Machine Learning, Vienna, Austria. PMLR 235, 2024. Copyright 2024 by the author(s).

grained representational results in the scaling regime where context length grows but model depth is constant. In this regime, basic algorithmic tasks like matching parentheses and evaluating Boolean formulas are impossible.

In this work, we identify parallelism as a key to distinguishing transformers from other architectures. While recurrent architectures process their inputs serially, transformers allow independent interactions between the input tokens, mediated by the inner products between query and key embeddings in self-attention units. We leverage this property of self-attention to establish a formal connection between transformers and Massively Parallel Computation (MPC) (Karloff et al., 2010). Concretely, we design transformers that simulate MPC protocols (and vice versa), and in doing so, we exhibit a wide range of computational tasks that are solved by logarithmic-depth transformers, including tasks that cannot be efficiently solved with other architectures such as graph neural nets and recurrent models.

1.1 Our Results

We advance the understanding of transformers representational capabilities with the following results.

1. The algorithmic capabilities and limitations of logarithmic-depth transformers are captured by the MPC model (Section 3).

2. There is a simple sequential task that (i) is solved by (and, empirically, learned from data using) logarithmicdepth transformers, but (ii) cannot be efficiently solved by several alternative architectures (Sections 4 and 5).

In more detail, our first collection of results, Theorems 3.1 and 3.4, show that any R-round MPC protocol can be implemented by a transformer of depth O(R), and that any depth L transformer can be simulated by an O(L)-round MPC protocol. The former implies that several graph problems are solved by logarithmic-depth transformers (Corollary 3.3); the latter implies the near-optimality of these transformers (Corollary 3.5) conditional on a well-known conjecture about the limitations of MPC algorithms (Conjecture 2.4). A key technical step (Lemma 3.2) shows how transformers can implement the simultaneous message-passing used in MPC protocols to communicate between machines. While previous works (Sanford et al., 2023) have used communication

Transformers, Parallel Computation, and Logarithmic Depth

complexity to understand the representational limitations of self-attention layers, our results show the benefits of the communication lens for understanding the strengths of transformers as well.

Our second set of results concern the k-hop induction heads task, a synthetic sequential task that draws inspiration from the induction heads primitive of Elhage et al. (2021). The theoretical results of Section 4 prove that depth L = Θ(log k) is necessary and sufficient for efficient transformer representation. An accompanying empirical investigation reveals that transformers trained on the task obey the same threshold and recover a similar model to the theoretical construction. In contrast, Section 5 illustrates that non-parallelizable recurrent architectures including statespace models like Mamba (Gu & Dao, 2023) are unable to solve the task in a size-efficient manner. Moreover, wellknown transformer models with computationally-efficient alternatives to self-attention, like Performer (Choromanski et al., 2022) and Longformer (Beltagy et al., 2020), and shallow transformers with chain-of-thought prompting sacrifice their abilities to implement parallel algorithms, as evidenced by their proven inability to solve this task.

1.2 Related Work

Some of the types of lower bounds we sought in this work were inspired by the literature on depth-separation for feed-forward neural networks (e.g., Eldan & Shamir, 2016; Daniely, 2017; Telgarsky, 2016), which exhibit functions that are efficiently approximated by deep networks, but not by shallower networks.

Many theoretical approaches have been used to understand the representational capabilities of transformers and selfattention units in various scaling regimes. Some works model (variants of) transformers as machines for recognizing formal languages, such as the Dyck languages (Hahn, 2020; Bhattamishra et al., 2020; Yao et al., 2021; Hao et al., 2022) and star-free regular languages (Angluin et al., 2023). These approaches reveal inability of fixed-size transformers to handle arbitrarily long inputs. Other works show how transformers can simulate finite-state automata (Liu et al., 2022) with logarithmic depth, and Turing machines with (unrolled) depth (or chain-of-thought length) scaling polynomially with total runtime (Wei et al., 2021; Malach, 2023; Merrill & Sabharwal, 2023b). However, it is unclear if these results are near optimal or even transformer-specific.

Theoretical results about the limitations of constant-depth transformers have been articulated by way of analogy to circuit complexity (Merrill & Sabharwal, 2023a; Merrill et al., 2022; Merrill & Sabharwal, 2022; Strobl, 2023; Strobl et al., 2023), implying the inability of constant-depth transformers to solve tasks like graph connectivity and Boolean formula evaluation. Other works characterize the representational

capabilities of one-layer transformers (Likhosherstov et al., 2021; Sanford et al., 2023), but these approaches do not apply to deeper models. Sanford et al. study multi-headed attention using communication complexity, a framing that informs this work s connection to distributed computing.

The MPC model (Karloff et al., 2010; Beame et al., 2017; Goodrich et al., 2011; Andoni et al., 2014; Im et al., 2023) was introduced to study distributed computing frameworks such as Map Reduce (Dean & Ghemawat, 2004). A major goal is to design protocols that use few rounds of communication for setups in which each machine s local memory is sublinear in the input size. Many advances have been made in MPC algorithms for important problems, including weighted interval selection, approximate maximum matching, and clustering (see, e.g., Im et al., 2023, for a recent survey). Nanongkai & Scquizzato (2022) established equivalence classes among MPC algorithmic tasks, proving that determining the connectivity of a graph is equivalent to numerous other graph reasoning tasks such as bipartiteness testing and cycle detection in O(1) rounds of MPC computation.

The centrality of graph connectivity to the study of MPC is further evident in its conjectured hardness. Connectivity in sparse graphs is a basic problem that has resisted progress, and all all MPC protocols in this memory regime appear to require Ω(log n) rounds for input graphs on n vertices. Lower bounds in MPC and related models were studied by Beame et al. (2017), Roughgarden et al. (2018), and Charikar et al. (2020). The conjectured impossibility of o(log n)-round protocols for connectivity is now used as basis for conditional lower bounds (Ghaffari et al., 2019).

Simulation of transformers by recurrent models (Oren et al., 2024) and simulation of graph neural nets (GNNs) by transformers (Kim et al., 2022) offer some coarse-grain insight into the relationship between these architectures, but separations are not implied by these previous works. Our connection between transformers and MPC is most similar to that established by Loukas (2019) between GNNs and the CONGEST model of distributed computation. Both works establish positive and negative results by identifying neural architectures with communication protocols. In Section 5.1, we show that the MPC connection allows transformers solve graph connectivity more efficiently than GNNs.

Our k-hop induction heads task is designed as a k-fold composition of its standard analogue (Elhage et al., 2021). It is similar to a special case of the LEGO reasoning task (Zhang et al., 2023), which reveals the super-linear benefit of depth with respect to k; in our case, we theoretically and empirically exhibit an exponential benefit. We also draw a connection to the well-studied problem of pointerchasing (Papadimitriou & Sipser, 1982; Duris et al., 1984; Nisan & Wigderson, 1993), which enables the proof of our

Transformers, Parallel Computation, and Logarithmic Depth

separation between parallel and serial architectures. Our fine-grained empirical interpretability analysis for synthetic tasks draws inspiration from similar approaches for the analysis of sequential algorithms like sorting and reversal (Li & Mc Clelland, 2022).

2 Preliminaries

2.1 Transformers

We first define a self-attention head, the core primitive of a transformer. The softmax operator is softmax(v) = (exp(v1), . . . , exp(v N))/ PN j=1 exp(vj) for v RN. We apply softmax to matrices A RN N row-wise, i.e. softmax(A)i = softmax((Ai,1, . . . , Ai,N)).

Definition 2.1 (Self-attention head). A self-attention head is a mapping f Q,K,V : RN m RN m defined by

f Q,K,V (X) = softmax(Q(X)K(X)T)V (X)

and parameterized by row-wise query, key, and value embeddings Q, K, V : RN m RN m (e.g., Q(X) = (Q1(X1), . . . , QN(XN)). Let Attn N m denote the set of all self-attention heads with embedding dimension m and context length N.

A transformer composes L layers of H self-attention heads per layer, plus an output multi-layer perceptron (MLP).

Definition 2.2 (Transformer). A transformer is a mapping T : RN din RN dout specified by self-attention heads (fℓ,h Attn N m)ℓ [L],h [H] and element-wise input and output MLPs:

ϕ = (ϕ1, . . . , ϕN), ψ = (ψ1, . . . , ψN) : RN m RN dout.

Upon input X RN din, the transformer computes intermediate embeddings X0, . . . , XL RN m with X0 = ϕ(X) and

Xℓ= Xℓ 1 + XH

h=1fℓ,h(Xℓ 1),

and returns T(X) = ψ(XL) as output. Let Transformer N m,L,H,din,dout denote the set of all such transformers, and Transformer N m,L,H := Transformer N m,L,H,1,1.

Modeling assumptions. We treat the transformer as a computational model that permits arbitrary element-wise computation, but restricts the manner in which multiple elements are processed together. This manifests in our decisions to model query/key/value embeddings and MLPs as arbitrary functions on the embedding space; Loukas (2019) employs a similar modeling assumption for GNNs. Note that the element-wise embeddings and MLPs may be indexspecific, obviating the need for positional embeddings.

Our theoretical results cover the scaling regime where the context length N is the main asymptotic parameter; while the embedding dimension m, the number of heads H, and the depth L grow sub-linearly in N. This reflects real-world trends in large-language models, where context length has sharply increased in recent years.

Throughout, we assume all intermediate computations in transformers are represented by p-bit precision numbers for p = Θ(log N). Limiting the precision is consistent with recent practice of using low-precision arithmetic with transformers (e.g., Wang et al., 2022; Dettmers et al., 2022). We discuss this precision assumption in greater detail in Appendix A.1, along with other minor technical assumptions (such as the inclusion of a start token for mathematical convenience).

Masked Transformers. We also consider masked selfattention, where only certain inner products influence the softmax output. Let Λ { , 0}N N be a masking matrix with at least one zero entry in every row. Then, a Λ-masked self-attention unit is defined by

f Λ Q,K,V (X) = softmax(Q(X)K(X)T + Λ)V (X).

Let Λ-Attn N m and Λ-Transformer N m,L,H, respectively, denote the sets of all Λ-masked self-attention heads and all transformers comprised of those heads. We define causallymasked transformers by Mask Attn N m := Γ-Attn N m and Mask Transformer N m,L,H := Γ-Transformer N m,L,H, where Γ is the lower-triangular mask with Γi,j = 0 iff i j.

2.2 Massively Parallel Computation Model

We use the definition of MPC from Andoni et al. (2018).

Definition 2.3 (MPC protocol). For any global and local memory constants γ, δ > 0, a (γ, δ)-MPC protocol for a function f : Znin 2p Znout 2p specifies a distributed computing protocol for q = Θ(n1+γ δ in ) machines, each with s = O(nδ in) words1 of local memory to jointly compute f(Input) for any given Input Znin 2p as follows. The Input Znin 2p is distributed across the local memories of the first nin/s machines. Computation proceeds in rounds. In each round, each machine computes an arbitrary function of its local memory to prepare at most s words to send to other machines; messages are simultaneously transmitted, and the protocol ensures that each machine receives at most s words at the end of the round. After the final round, the Output = f(Input) Znout 2p is in the local memories of the first nout/s machines. See Figure 1 for details.

Our negative results in Section 3.2 are conditional on the well-known one-versus-two cycle conjecture (Beame

1We assume the word size is p = Θ(log nin) bits. For convenience, we regard words as elements of Z2p (integers mod 2p).

Transformers, Parallel Computation, and Logarithmic Depth

Input=(Input1, . . . , Inputnin) Znin 2p is distributed across local memories of machines 1 i nin

Machine In(1) i = {(Inputι, ι) : ι {(s 1)i+1, . . . , min {nin, si}}}.

For round r = 1, . . . , R:

Each machine i [q] computes messages (Msg Out(r) i,j )j=1,2,... to send to machines (Dest(r) i,j )j=1,2,... as function of Machine In(r) i :

Machine Out(r) i = Localr,i(Machine In(r) i ) = {(Msg Out(r) i,j , Dest(r) i,j ) Z dj 2p [q] : j = 1, 2, . . . }; X

jdj s is ensured.

All messages are simultaneously transmitted; the messages in local memory of machine i [q] for round r + 1 are:

Machine In(r+1) i = {(Msg, Src) : (Msg, i) Machine Out(r) Src}; X

(Msg,Src) Machine In(r+1) i |Msg| s is ensured.

Output=f(Input) comes from Machine In(R+1) i = {(Outputι, Src) : ι {(s 1)i+1, . . . , min {nout, si}}} for 1 i nout

Figure 1. Formal execution of an MPC protocol for computing f : Znin 2p Znout 2p . (|Msg| is the number of words in Msg.)

et al., 2017; Roughgarden et al., 2018; Ghaffari et al., 2019). Conjecture 2.4 (see, e.g., Ghaffari et al., 2019). For any γ > 0, δ < 1, and N, if π is an (γ, δ)-MPC protocol that distinguishes a single cycle on N nodes and a union of two cycles each on N/2 nodes, then π uses Ω(log N) rounds.

2.3 Graphs as Sequential Inputs

When providing a graph G = (V, E) as input to transformers or MPC protocols, we serialize G as a sequence in [|V |]2|E| that encodes each edge as a pair of vertex tokens. The resulting transformer has N = 2|E| and din = 1, and the resulting MPC protocol has nin = 2|E|.

3 Relating Transformers and MPC

We coarsely characterize the computational power of transformers in a certain size regime by establishing a bidirectional relationship between transformers and MPC. Theorems 3.1 and 3.4 show that any MPC protocol can be simulated by a transformer, and vice versa. As corollaries (Corollaries 3.3 and 3.5), we obtain tight upper and lower bounds on the depth of bounded-size transformers for computing connected components in graphs.

3.1 Simulation of MPC Protocols by Transformers

The following theorem shows that any MPC protocol π with sublinear local memory can be simulated by a transformer whose depth L is linear in the number of rounds R of π, and embedding dimension m is polynomial in the local memory size s = O(N δ) of machines used by π. Theorem 3.1. For constants 0 < γ < δ < 1 and any deterministic R-round (γ, δ)-MPC protocol π on nin input words and nout nin output words, there exists a transformer T Transformer N m,L,H with N = nin, m = O(n4δ in log nin), L = R + 1, H = O(log log nin) such that T(Input):nout = π(Input) for all Input ZN 2p.

The theorem provides a non-trivial construction in the strongly sub-linear local memory regime when s =

O(N 1/4 ϵ) for any ϵ > 0.2 Because numerous tasks, including approximate maximum matching, submodular maximization, and weighted interval selection, can be solved by MPC protocols with O(N δ) memory for any fixed δ (0, 1) (Ghaffari, 2019), these tasks are similarly implementable by transformers with sub-linear embedding dimension. Subsequent work by Sanford et al. (2024) improves this analysis by proving that any MPC protocol with local memory s = O(N 1 ϵ) for any ϵ (0, 1) can be simulated by a transformer of embedding dimension m = O(N 1 ϵ ) for some ϵ (ϵ, 1).

Theorem 3.1 Proof Overview. At a high level, the proof in Appendix B.2 entails simulating each round of parallel computation with a single-layer transformer and applying those constructions serially to Input. The local computation on each machine (represented by Machine Out(r) i = Localr,i(Machine In(r) i )) is directly encoded using element-wise query/key/value embeddings.

The crux of the proof involves the simulation of a routing protocol to determine Machine In(r+1) from Machine Out(r). We construct a self-attention unit that ensures that an encoding of a sequence of addressed messages from each machine are properly routed to their destinations.3

For any message size β, message count bound s, and number of tokens N, we say that (Sent, Rcvd) RN m RN m

is a valid (β, s)-routing if, for each i [N], the i-th row of Sent (resp. Rcvd) is the vector encoding of some Senti Zβ 2p [N] (resp. Rcvdi Zβ 2p [N]) such that

Rcvdi = {(Msg, Src) : (Msg, i) Sent Src} ,

2Applying Theorem 3.1 when δ 1

4 yields transformers with embedding dimension m N, which trivializes the transformer architecture and negates any advantages of depth under our MLP universality assumption. This is due to the fact a transformer with N-dimensional embeddings could aggregate the entire input sequence X RN in a single embedding and use its output MLP to compute any arbitrary function on that input. 3This routing between machines uses the all-pairs structure of self-attention and may not admit a subquadratic approximation.

Transformers, Parallel Computation, and Logarithmic Depth

and each of Rcvdi and Senti has cardinality at most s.4

Lemma 3.2. For any β, s, N N, there exists a transformer routeβ,s Transformer N m,1,1 with m = O(s4β log N) satisfying routeβ,s(Sent) = Rcvd for any valid (β, s)-routing (Sent, Rcvd).

The proof of Lemma 3.2 appears in Appendix B.1 and combines two key techniques: sparse propagation and multiple hashing. The former is a simple variant of the sparse averaging task of Sanford et al. (2023), which simultaneously computes N averages over subsets of inputs; this task is solved a single self-attention head with small embedding dimension (Proposition B.1). Using sparse propagation, we construct a self-attention head that averages the s encodings of each Rcvd Src for every Src Rcvdi. In order to ensure that we can decode that average of encodings, we apply error-correction by encoding each Outputi in a sparse and redundant manner, where each outgoing messages appears as multiple copies of the same addressed packet.

Application: Connectivity with Log-Depth Transformers. As an immediate consequence of Theorem 3.1, any graph problem solvable with a logarithmic number of rounds of MPC computation (and local memory s) is also computable by a logarithmic depth transformer (and embedding dimension O(s4)). The following result which bounds transformer depth needed to compute connected components of a graph G follows from Theorem 6.2 of Coy & Czumaj (2022), which derandomizes an MPC algorithm of Behnezhad et al. (2019), and Theorem 3.1.

Corollary 3.3. For any constant ϵ (0, 1) and any D N, there exists a transformer in Transformer N m,L,H with m = O(N ϵ), H = O(log log N), and L = O(log D) that identifies the connected components of any input graph G = (V, E) with |V |, |E| = O(N) where each connected component has diameter at most D.

Coy & Czumaj also give efficient MPC algorithms for other related problems (e.g., spanning forest), so we obtain efficient transformers for these problems, too (Appendix B.3).

3.2 Simulation of Transformers by MPC protocols

The following theorem shows that MPC protocols can simulate transformers and prove depth lower bounds on transformers, conditioned on Conjecture 2.4. We get, as a corollary, the conditional optimality of the transformer depth bound in Corollary 3.3.

Theorem 3.4. For any transformer T Transformer N m,L,H (or Λ-Transformer N m,L,H) with m H = O(N δ) for δ (0, 1) and any δ (δ, 1), there exists a O( L δ δ)-round

4We abuse notation by writing Dest Senti to mean there exists some Msg such that (Msg, Dest) Senti.

(1 + δ , δ )-MPC protocol with q = O(N 2) machines with s = O(N δ ) local memory for computing T.

Theorem 3.4 demonstrates that the algorithmic capabilities of transformers are no stronger than those of MPC protocols with a quadratic scaling in the number of machines. While Theorems 3.1 and 3.4 do not jointly provide a sharp characterization of the two computational models, the reductions are tight enough to provide strong evidence for the optimality of the connected components construction of Corollary 3.3.

Theorem 3.4 Proof Overview. At a high-level, the proof in Appendix C.1 constructs an MPC protocol that simulates a self-attention layer by separating the computation of MLPs and attention matrices into three separate categories of machines.

Each input token is provided to its own token machine, responsible for preparing the query/key/value embeddings.

Each pair of tokens is associated with an inner product machine that will compute the inner product between their respective query and key embeddings.

Propagation machines ensure that embeddings are routed to the proper inner product machine and compute outputs of each softmax unit.

The proof gives the communication protocol for these machines, shows how they simulate a layer of self-attention in O(1/(δ δ)) rounds, and establishes the sufficiency of O(N 2) machines with O(N δ ) local memory.

Application: Conditional Optimality of Corollary 3.3. Assuming the well-established Conjecture 2.4, we prove an Ω(log D) lower bound on the depth of parameter-efficient transformers for determining connectivity of graphs where connected components may have diameter up to D.

Corollary 3.5. Let ϵ (0, 1) be any constant, and let D N ϵ. Assume Conjecture 2.4, and suppose there exists T Transformer N m,L,H with m H = O(D1 ϵ) that decides connectivity of any input graph with connected components having diameter D. Then L = Ω(log D).

4 Transformers for k-Hop Induction Heads

We complement the generality of Section 3 by studying, both empirically and theoretically, a specific toy sequential modeling task which will also serve (in Section 5) as a problem to separate the representational capabilities of transformers from that of other neural architectures.

This task, called the k-hop induction heads task, draws inspiration from the original induction heads task defined and analyzed on trained language models and in synthetic environments by Elhage et al. (2021) (see also Bietti et al.,

Transformers, Parallel Computation, and Logarithmic Depth

2023). The standard induction heads task completes bigrams auto-regressively by predicting the token that follows the last previous occurrence of the final token in the sequence. For example, given the input X = baebcabebdea, the standard induction heads task is to complete the final bigram by predicting b for the final token.

The k-hop induction heads tasks generalizes this mechanism by repeatedly using the completion of a bigram to determine the next bigram to complete. In the previous example, the 2-hop induction heads task is to predict c for the final token:

baebcabebdea.

Definition 4.1. For any finite alphabet Σ, define the map hopk : ΣN (Σ { })N by hopk(X)i = Xfindk X(i) if

findk X(i) = 0 and otherwise, where

find1 X(i) = max({0} {j N : j i, Xj 1 = Xi});

findk X(i) = find1 X(findk 1 X (i)) for k 2.

The k-hop induction heads task is to compute, for each i = 1, . . . , N, the value of hopk(X)i from (X1, . . . , Xi).

We note a similarity to the LEGO tasks of (Zhang et al., 2023), who empirically study the ability of transformers to learn sequential operations on Abelian groups and observe the ability to perform more operations than the depth of the network.

4.1 Log-Depth Transformer for k-Hop Induction Heads

Although hopk appears to requires k steps to solve, we show that it is solved by a transformer of depth O(log k).

Theorem 4.2. For any k N and alphabet Σ with |Σ| N, there exists T Mask Transformer N m,L,H that computes hopk : ΣN (Σ { })N with m = O(1), L = log2 k + 2, and H = 1.

In contrast to Corollary 3.3, this construction has constant embedding dimension and is achieved by a causally-masked transformer. As such, its proof in Appendix D.1 depends on other techniques that exploit the simplicity of the problem and build on the induction heads construction of Bietti et al. (2023), rather than simply applying Theorem 3.1.

We give evidence for the optimality of this construction by proving a conditional lower bound using Theorem 3.4, as was done in Corollary 3.5.

Corollary 4.3. Assuming Conjecture 2.4, for any constants ξ (0, 1/2] and ϵ (0, 1), and any even k = Θ(N ξ), every transformer T Mask Transformer N m,L,H with m H = O(k1 ϵ) that computes hopk has depth L = Ω(log k).

4.2 Log-Depth Transformer Learned from Data

We empirically assess whether the representational tradeoffs elucidated by tasks efficiently solved by parallelizable algorithms have implications for optimization and generalization properties of transformers. To that end, we trained auto-regressive transformer architectures of varying sizes to solve hopk(X) for a variety of values of k in order to understand how changing depth impacted the performance of the learned models, the goal being to verify the sufficiency of logarithmic depth, just as in our theory.

In brief, we trained transformers with 500K to 5M parameters and depths {2, 3, 4, 5, 6} with Adam to solve hopk(X) for k {0, . . . , 16} with context length |N| = 100 and alphabet size |Σ| = 4. We trained the transformers in a multi-task setting, where a single model was trained to predict the sequence hopk(X) auto-regressively when provided with X and k drawn at random. Further experimental details can be found in Appendix G.1, and the experimental code is available at https://github.com/chsanford/ hop-induction-heads.

We found that transformers are indeed capable of learning hopk given sufficient training time, and that the largest learnable k grows exponentially with the depth. As can be seen in Figure 2, a six-layer neural network performs well on all k 16, a five-layer on k 8, a four-layer on k 4, and so forth. We further explore these experimental results in Appendix G.2 and observe a performance threshold appears to specifically lie at log2 k + 2 that coincides with Theorem 4.2. This logarithmic dependence of the depth on k persists in a larger-width regime, which is explored in Appendix G.3. In the finite sample regime where neural networks are prone to overfit, our investigations in Appendix G.5 note improved generalization in deeper models, which suggests that deeper models have a favorable inductive bias for tasks like hopk.

Moreover, the learned models are surprisingly interpretable. We examined the activation patterns of attention matrices, and found close correspondences to useful intermediate products such as findj X. Taken together, these indicate that the learned models mechanistically resemble the construction employed in the proof of Theorem 4.2. See Appendix G.4 for our investigation of model interpretability.

5 Separations between Transformers and Alternative Architectures

Sections 3 and 4 characterize the representational capability of transformers by providing algorithmic problems they can solve with logarithmic depth and small polynomial or constant width. In contrast, other well-known architectures are unable to solve those same problems in a parameter-efficient

Transformers, Parallel Computation, and Logarithmic Depth

1 2 4 8 16 k

L-layer transformer token-wise classification error on hopk

L = 2 L = 3 L = 4 L = 5 L = 6

Figure 2. Evaluation of transformers of depths L {2, 3, 4, 5, 6} trained on a mixture of hopk for k {0, . . . , 16} evaluated on n = 100 samples of size N = 100 from each hopk. Incrementing depth approximately doubles the largest k such that hopk is learnable with small error.

manner. This section provides lower bounds on the parameter complexity of graph neural networks (GNNs), recurrent neural architectures, transformers with computationally efficient alternatives to softmax self-attention, and single-layer transformers with autoregressive chain-of-thought tokens needed to solve graph connectivity and the k-hop task.

5.1 GNNs Need Polynomial Depth for Graph Connectivity

The bidirectional relationship between transformers and MPC draws inspiration from past work drawing a similar connection between message passing graph neural networks (GNNmp) and the CONGEST distributed computing model (Loukas, 2019). Their computation model of GNNmp for width m and depth L closely resembles our Transformer N m,L,H in providing a general framework for the analysis of graph neural networks by allowing unbounded computation in each vertex with bounded communication on edges. On some input graph G, vertices send neighbors messages of size at most m which are aggregated and crafted into new messages with MLPs over L rounds of communication.

By restating Corollary 4.2 of (Loukas, 2019), we demonstrate a sharp contrast in the abilities of GNNs and transformers to solve graph algorithmic tasks.

Theorem 5.1 (Corollary 4.2 of (Loukas, 2019)). There exists a graph G with N edges such that any GNNmp with width m and depth L that determines whether an input subgraph H either (1) is connected or (2) forms a spanning tree of G requires L m = Ω(N 1/4).

While Corollaries 3.3 and B.8 demonstrate the ability of transformers to determine whether any input graph is connected5 or to identify a spanning tree with logarithmic depth and small polynomial width (i.e. m = O(N ϵ)), GNNs require depth L = Ω(N 1/4 ϵ/2) in the same regime. This gap is explainable by the fact that transformers on graph inputs G are not bound to pass messages exclusively along the edges of G. By rewiring the graphical structure in each layer, transformers can perform aggregation and pointer passing tasks with greater parametric ease than GNNs.

5.2 Suboptimality of Recurrent Architectures for hopk

The logarithmic-depth and constant-width transformer implementation of hopk in Theorem 4.2 cannot be replicated by recurrent neural architectures (Chung et al., 2014; Bengio et al., 1994; Turkoglu et al., 2021), including not just multilayer recurrent neural networks (RNNs) but any sequential prediction procedure equivalent to them at inference time, which includes state space models such as Mamba (Gu & Dao, 2023).

We first consider a family of multi-layer RNNs of depth L and width m, consisting of arbitrary MLP units gℓ: Rm m Rm m, which on input X RN din produce output Y RN dout as follows using intermediates X = Z0, Z1, . . . , ZL 1, ZL = Y RN m6and hidden states H1, . . . , HL {0, 1}N m with Hℓ 0 = 0:

(Zℓ i , Hℓ i ) = gℓ(Zℓ 1 i , Hℓ i 1), i [N], ℓ [L].

We provide a polynomial bound on the width and depth of a multi-layer RNN solving hopk.

Corollary 5.2. A multi-layer RNN of depth L and width m as above with YN = hopk(X)N satisfies either L k or m = Ω( N

In contrast to Theorem 4.2, which demonstrates that depth O(log k) transformers with constant width suffice to solve hopk for any k, Corollary 5.2 demonstrates that all multilayer RNNs with width O(N 1/7) require depth k when k = O(N 1/7).

Mamba (Gu & Dao, 2023) can be seen as the combination of three ideas: (1) a continuous-time dynamics model of sequential prediction, powerful enough to model Kalman filters, hidden markov models, and many others; (2) a family of time-discretization schemes; (3) an unrolling technique to enable efficient linear-time training, using ideas similar to

5While the problem of subgraph connectivity for GNNs may at first glance appear more difficult than general graph connectivity for transformers, an implementation of this exact task can be implemented by modifying the protocol Corollary 3.3 to remove all edges from the graph that do not belong to H. 6We assume that din, dout m and treat X and Y as if they are padded with zeros.

Transformers, Parallel Computation, and Logarithmic Depth

Flash Attention (Dao et al., 2022). Ultimately, at inference time, the time-discretization step results in an RNN (see Gu & Dao, 2023, Algorithm 2 and Theorem 1), and is therefore directly handled by Corollary 5.2.

This corollary is a near immediate application of a communication complexity fact about the hardness of solving multi-player pointer-chasing problems with limited communication among players (Guha & Mc Gregor, 2009; Assadi & N, 2021). We provide the communication model and this result in Appendix E.1, and the reductions necessary to prove the above hardness results in Appendix E.2.

5.3 Suboptimality of Sub-Quadratic Attention Transformers for hopk

Due to the quadratic computational cost of computing the attention matrix softmax(Q(X)K(X)T ) RN N and the continued desire for ever-larger context lengths, there is substantial interest in improving the computational complexity of the transformer architecture while preserving its expressive capabilities and inductive biases. As a result, a rich literature has emerged that proposes computationally-efficient alternatives to standard softmax attention. In this section, we demonstrate how several representative examples of subquadratic attention mechanisms lose the ability to perform efficient parallel computation under a logarithmic-depth scaling.

Kernel-Based Sub-Quadratic Attention. One approach to computationally-efficient approximation of transformers are kernel-based sub-quadratic attention mechanisms such as Performer (Choromanski et al., 2022), and Poly Sketchformer (Kacham et al., 2023). Both approximate the attention matrix softmax(Q(X)K(X)T) with a lowrank matrix Q (X)K (X)T where Q , K : Rm Rm

are applied element-wise. For sufficiently small m N, Q (X)K (X)TV (X) can be computed efficiently by first computing K (X)TV (X) Rm m, bounding the total runtime as O(Nmm ), rather than O(N 2m).

Let Kernel Former N m,m ,L,H denote all H-headed L-layer transformer whose softmax attention modules are replaced by kernel-based sub-quadratic attention. We demonstrate the limitations of Kernel Former N m,m ,L,H by showing that, unlike Transformer N m,L,H, they have no depth-efficient implementation of hopk.

Corollary 5.3. Any T Kernel Former N m,m ,L,H with T(X)N = hopk(X)N satisfies either L k or mm Hp = Ω( N

Under a parameter-efficient regime where mp HL = O(N ϵ), solving hopk for k = Θ(N ϵ) necessitates kernel feature dimension m = Ω(N 1 9ϵ), which forces each attention unit to compute an N N 1 9ϵ matrix, yielding a

nearly quadratic runtime. We prove Corollary 5.3 in Appendix E.3 using a similar pointer chasing reduction.

Masking-Based Sub-Quadratic Attention. Another method that reduces the computational cost of transformers is to used masked models of Λ-Transformer N m,L,H for a sparse mask Λ. The Longformer architecture (Beltagy et al., 2020) introduces a particular masked architecture that combines sliding windows with sparse unmasked global tokens. Put concretely, for window radius w and global frequency g, let Λw,g { , 0}N N be masking matrix with

( 0 if |i j| w or j 0 (mod g), otherwise.

Then, the output of a single unit of Λw,g-masked attention is computable in time O((w + N

Corollary 5.4. Any T Λw,g-Attn N m,L,H with T(X)N = hopk(X)N satisfies either L k or (w + N

gk)m Hp = Ω( N

Like kernel-based attention, sparsely-masked attention models fail to efficiently compute hopk. Similarly, in the same parameter-efficient regime, a Longformer must have either w = Ω(N 1 9ϵ) or g = O(N 9ϵ), which jointly ensures that the masked matrix has at least Ω(N 2 9ϵ) entries and diminishes any computational advantages. This proof also appears in Appendix E.3.

5.4 Limitations of 1-Layer Transformers with Chain-of-Thought

While most of the paper considers transformers as sequenceto-sequence models, we can also frame them as autoregressive models performing next-token-prediction with chain-of-thought prompting. In this regime, a single causally-masked transformer aims to compute a function of its input by repeatedly predicting the next token, appending previously predicted tokens to the end of the input. In doing so, a function is computable if there exists an intermediate chain-of-thought produced by the model that eventually reaches the answer.

Definition 5.5. We say that T Mask Transformer N+NCo T m,L,H computes f : ΣN+NCo T ΣN, where the additional N tokens denote chain-of-thought, if for every X dom(f), there exists XCo T ΣNCo T such that T(X XCo T)N:N+NCo T = (XCo T f(X)).

The theoretical capabilities of chain-of-thought augmented transformers to simulate finite-state automata and Turing machines have been studied (Malach, 2023; Merrill & Sabharwal, 2023b), but the comparative capabilities of shallow models with chain-of-thought prompting and deep sequential models are unknown. In contrast to the fact that any

Transformers, Parallel Computation, and Logarithmic Depth

transformer with NCo T tokens can be simulated by a sequential model with depth scaled by NCo T, we show that deep transformers cannot necessarily be efficiently simulated by shallow chain-of-thought models. We do so by demonstrating that a linear amount of chain-of-thought prompting in k is necessary to solve hopk(X)N, and also sufficient.

Corollary 5.6. Any T Mask Transformer N+NCo T m,1,H that computes hopk(X)N with NCo T tokens of chain-of-thought requires either NCo T k or m Hp = Ω( N

The proof appears in Appendix E.4. For future work, it remains to consider the comparative powers of chain-ofthought models of depths greater than one.

6 Conclusion and Future Work

This work highlights parallelism as a central feature of transformers that sets them apart from other neural architectures. The focus on the log-depth and sublinear-width regime and specific computational tasks allows us to accentuate the benefits of parallelism, even for tasks like k-hop that appear inherently serial at first glance.

There is some efficiency loss in the compilation of MPC protocols to transformers that subsequent work by Sanford et al. (2024) remedies by extending Theorem 3.1 to all MPC algorithms with strictly sublinear local memory. Although we have empirically demonstrated the learnability of transformers that exploit parallelism in crucial ways, a theoretical understanding of learning such solutions remains an open question.

Finally, a unified theoretical framework for transformers and parallel computation that addresses both task parallelism and hardware paralellism would be of great value. While the MPC model delivers substantial insight into the algorithmic capabilities of transformers, its local memory assumptions preclude a useful analysis of the inference and training runtime capabilities of modern GPU hardware. Future work in this direction may integrate the results of this paper with works that present approaches to model and data parallelism (e.g. Shoeybi et al., 2019).

Acknowledgements

C.S. acknowledges the support of an NSF GRFP award and NSF grants IIS-1838154, IIS-2040971, CCF-2106429, and CCF-2211238. D.H. acknowledges the support of NSF grant IIS-2040971. The authors appreciate the literature recommendations and insights from conversations with Yuval Efron, Peilin Zhong, Alex Andoni, Rocco Servedio, and Vahab Mirrokni.

Impact Statement

This paper presents work whose goal is to advance the field of Machine Learning. There are many potential societal consequences of our work, none which we feel must be specifically highlighted here.

Agarwal, A., Chapelle, O., Dudík, M., and Langford, J. A reliable effective terascale linear learning system. Journal of Machine Learning Research, 15(1):1111 1133, 2014.

Andoni, A., Nikolov, A., Onak, K., and Yaroslavtsev, G. Parallel algorithms for geometric graph problems. In Proceedings of the forty-sixth annual ACM symposium on Theory of computing, pp. 574 583, 2014.

Andoni, A., Song, Z., Stein, C., Wang, Z., and Zhong, P. Parallel graph connectivity in log diameter rounds. In 2018 IEEE 59th Annual Symposium on Foundations of Computer Science (FOCS). IEEE, October 2018. doi: 10. 1109/focs.2018.00070. URL http://dx.doi.org/ 10.1109/FOCS.2018.00070.

Angluin, D., Chiang, D., and Yang, A. Masked hardattention transformers and boolean rasp recognize exactly the star-free languages, 2023.

Assadi, S. and N, V. Graph streaming lower bounds for parameter estimation and property testing via a streaming xor lemma. In Proceedings of the 53rd Annual ACM SIGACT Symposium on Theory of Computing, STOC 21. ACM, June 2021. doi: 10.1145/3406325.3451110. URL http://dx.doi.org/10.1145/3406325. 3451110.

Beame, P., Koutris, P., and Suciu, D. Communication steps for parallel query processing. Journal of the ACM (JACM), 64(6):1 58, 2017.

Behnezhad, S., Brandt, S., Derakhshan, M., Fischer, M., Hajiaghayi, M., Karp, R. M., and Uitto, J. Massively parallel computation of matching and mis in sparse graphs. In Proceedings of the 2019 ACM Symposium on Principles of Distributed Computing, pp. 481 490, 2019.

Beltagy, I., Peters, M. E., and Cohan, A. Longformer: The long-document transformer, 2020.

Bengio, Y., Simard, P., and Frasconi, P. Learning long-term dependencies with gradient descent is difficult. IEEE Transactions on Neural Networks, 5(2):157 166, 1994. doi: 10.1109/72.279181.

Bhattamishra, S., Ahuja, K., and Goyal, N. On the ability and limitations of transformers to recognize formal

Transformers, Parallel Computation, and Logarithmic Depth

languages. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing, 2020.

Bietti, A., Cabannes, V., Bouchacourt, D., Jegou, H., and Bottou, L. Birth of a transformer: A memory viewpoint, 2023.

Charikar, M., Ma, W., and Tan, L.-Y. New lower bounds for massively parallel computation from query complexity, 2020.

Choromanski, K., Likhosherstov, V., Dohan, D., Song, X., Gane, A., Sarlos, T., Hawkins, P., Davis, J., Mohiuddin, A., Kaiser, L., Belanger, D., Colwell, L., and Weller, A. Rethinking attention with performers, 2022.

Chung, J., Gulcehre, C., Cho, K., and Bengio, Y. Empirical evaluation of gated recurrent neural networks on sequence modeling. ar Xiv preprint ar Xiv:1412.3555, 2014.

Clark, K., Khandelwal, U., Levy, O., and Manning, C. D. What does bert look at? an analysis of bert s attention. ar Xiv preprint ar Xiv:1906.04341, 2019.

Coy, S. and Czumaj, A. Deterministic massively parallel connectivity. In Proceedings of the 54th Annual ACM SIGACT Symposium on Theory of Computing, STOC 2022, pp. 162 175, New York, NY, USA, 2022. Association for Computing Machinery. ISBN 9781450392648. doi: 10.1145/3519935.3520055. URL https://doi. org/10.1145/3519935.3520055.

Daniely, A. Depth separation for neural networks. In Kale, S. and Shamir, O. (eds.), Proceedings of the 2017 Conference on Learning Theory, volume 65 of Proceedings of Machine Learning Research, pp. 690 696. PMLR, 07 10 Jul 2017. URL https://proceedings.mlr. press/v65/daniely17a.html.

Dao, T., Fu, D. Y., Ermon, S., Rudra, A., and Ré, C. Flashattention: Fast and memory-efficient exact attention with io-awareness. In Neur IPS, 2022.

Dean, J. and Ghemawat, S. Mapreduce: Simplified data processing on large clusters. In OSDI, pp. 137 150, 2004.

Dettmers, T., Lewis, M., Belkada, Y., and Zettlemoyer, L. Llm.int8(): 8-bit matrix multiplication for transformers at scale. In Advances in Neural Information Processing Systems, volume 35, 2022.

Duris, P., Galil, Z., and Schnitger, G. Lower bounds on communication complexity. In Proceedings of the Sixteenth Annual ACM Symposium on Theory of Computing, pp. 81 91, 1984.

Eldan, R. and Shamir, O. The power of depth for feedforward neural networks. In Feldman, V., Rakhlin,

A., and Shamir, O. (eds.), 29th Annual Conference on Learning Theory, volume 49 of Proceedings of Machine Learning Research, pp. 907 940, Columbia University, New York, New York, USA, 23 26 Jun 2016. PMLR. URL https://proceedings.mlr.press/v49/ eldan16.html.

Elhage, N., Nanda, N., Olsson, C., Henighan, T., Joseph, N., Mann, B., Askell, A., Bai, Y., Chen, A., Conerly, T., Das Sarma, N., Drain, D., Ganguli, D., Hatfield Dodds, Z., Hernandez, D., Jones, A., Kernion, J., Lovitt, L., Ndousse, K., Amodei, D., Brown, T., Clark, J., Kaplan, J., Mc Candlish, S., and Olah, C. A mathematical framework for transformer circuits. Transformer Circuits Thread, 2021. https://transformercircuits.pub/2021/framework/index.html.

Ghaffari, M. Massively parallel algorithms. URL: http://people. csail. mit. edu/ghaffari/MPA19/Notes/MPA. pdf, 2019.

Ghaffari, M., Kuhn, F., and Uitto, J. Conditional hardness results for massively parallel computation from distributed lower bounds. In IEEE 60th Annual Symposium on Foundations of Computer Science, pp. 1650 1663, 11 2019. doi: 10.1109/FOCS.2019.00097.

Goodrich, M. T., Sitchinava, N., and Zhang, Q. Sorting, searching, and simulation in the mapreduce framework. In International Symposium on Algorithms and Computation, pp. 374 383. Springer, 2011.

Gu, A. and Dao, T. Mamba: Linear-time sequence modeling with selective state spaces, 2023.

Guha, S. and Mc Gregor, A. Stream order and order statistics: Quantile estimation in random-order streams. SIAM Journal on Computing, 38(5):2044 2059, 2009. doi: 10.1137/07069328X. URL https://doi.org/10. 1137/07069328X.

Hahn, M. Theoretical limitations of self-attention in neural sequence models. Trans. Assoc. Comput. Linguistics, 8:156 171, 2020. doi: 10.1162/tacl\_{a}{\_ {0}{0}{3}}{0}6. URL https://doi.org/10. 1162/tacl_a_00306.

Hao, Y., Angluin, D., and Frank, R. Formal language recognition by hard attention transformers: Perspectives from circuit complexity. Trans. Assoc. Comput. Linguistics, 10:800 810, 2022. URL https://transacl.org/ ojs/index.php/tacl/article/view/3765.

Im, S., Kumar, R., Lattanzi, S., Moseley, B., Vassilvitskii, S., et al. Massively parallel computation: Algorithms and applications. Foundations and Trends in Optimization, 5(4):340 417, 2023.

Transformers, Parallel Computation, and Logarithmic Depth

Jumper, J., Evans, R., Pritzel, A., Green, T., Figurnov, M., Ronneberger, O., Tunyasuvunakool, K., Bates, R., Žídek, A., Potapenko, A., et al. Highly accurate protein structure prediction with alphafold. Nature, 596(7873):583 589, 2021.

Kacham, P., Mirrokni, V., and Zhong, P. Polysketchformer: Fast transformers via sketches for polynomial kernels, 2023.

Karloff, H., Suri, S., and Vassilvitskii, S. A model of computation for mapreduce. In Twenty-first Annual ACMSIAM Symposium on Discrete Algorithms, pp. 938 948, 12 2010. doi: 10.1137/1.9781611973075.76.

Kim, J., Nguyen, T. D., Min, S., Cho, S., Lee, M., Lee, H., and Hong, S. Pure transformers are powerful graph learners, 2022.

Kingma, D. P. and Ba, J. Adam: A method for stochastic optimization, 2014.

Li, Y. and Mc Clelland, J. L. Systematic generalization and emergent structures in transformers trained on structured tasks, 2022.

Likhosherstov, V., Choromanski, K., and Weller, A. On the expressive power of self-attention matrices. ar Xiv preprint ar Xiv:2106.03764, 2021.

Liu, B., Ash, J. T., Goel, S., Krishnamurthy, A., and Zhang, C. Transformers learn shortcuts to automata, 2022.

Loukas, A. What graph neural networks cannot learn: depth vs width. ar Xiv preprint ar Xiv:1907.03199, 2019.

Malach, E. Auto-regressive next-token predictors are universal learners, 2023.

Merrill, W. and Sabharwal, A. A logic for expressing logprecision transformers, 2022.

Merrill, W. and Sabharwal, A. The parallelism tradeoff: Limitations of log-precision transformers. Transactions of the Association for Computational Linguistics, 11: 531 545, 2023a. ISSN 2307-387X. doi: 10.1162/tacl_ a_00562. URL http://dx.doi.org/10.1162/ tacl_a_00562.

Merrill, W. and Sabharwal, A. The expressive power of transformers with chain of thought, 2023b.

Merrill, W., Sabharwal, A., and Smith, N. A. Saturated transformers are constant-depth threshold circuits. Transactions of the Association for Computational Linguistics, 10:843 856, 2022. ISSN 2307-387X. doi: 10.1162/tacl_a_00493. URL http://dx.doi.org/ 10.1162/tacl_a_00493.

MPICH. Mpi allreduce, 2023. URL https: //www.mpich.org/static/docs/latest/ www3/MPI_Allreduce.html.

Nanongkai, D. and Scquizzato, M. Equivalence classes and conditional hardness in massively parallel computations. Distributed Computing, 35(2):165 183, January 2022. ISSN 1432-0452. doi: 10.1007/ s00446-021-00418-2. URL http://dx.doi.org/ 10.1007/s00446-021-00418-2.

Nisan, N. and Wigderson, A. Rounds in communication complexity revisited. SIAM Journal on Computing, 22(1): 211 219, 1993. doi: 10.1137/0222016. URL https: //doi.org/10.1137/0222016.

Oren, M., Hassid, M., Adi, Y., and Schwartz, R. Transformers are multi-state rnns, 2024.

Papadimitriou, C. H. and Sipser, M. Communication complexity. In Proceedings of the Fourteenth Annual ACM Symposium on Theory of Computing, pp. 196 200, 1982.

Pérez, J., Barceló, P., and Marinkovic, J. Attention is turing complete. Journal of Machine Learning Research, 22(1): 3463 3497, 2021.

Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., and Sutskever, I. Language models are unsupervised multitask learners. Open AI blog, 1(8):9, 2019.

Rogers, A., Kovaleva, O., and Rumshisky, A. A primer in bertology: What we know about how bert works. Transactions of the Association for Computational Linguistics, 8:842 866, 2021.

Roughgarden, T., Vassilvitskii, S., and Wang, J. Shuffles and circuits (on lower bounds for modern parallel computation). Journal of the ACM, 65:1 24, 11 2018. doi: 10.1145/3232536.

Sanford, C., Hsu, D., and Telgarsky, M. Representational strengths and limitations of transformers, 2023.

Sanford, C., Fatemi, B., Hall, E., Tsitsulin, A., Kazemi, M., Halcrow, J., Perozzi, B., and Mirrokni, V. Understanding transformer reasoning capabilities via graph algorithms, 2024.

Shoeybi, M., Patwary, M., Puri, R., Le Gresley, P., Casper, J., and Catanzaro, B. Megatron-lm: Training multibillion parameter language models using model parallelism, 2019.

Strobl, L. Average-hard attention transformers are constantdepth uniform threshold circuits, 2023.

Transformers, Parallel Computation, and Logarithmic Depth

Strobl, L., Merrill, W., Weiss, G., Chiang, D., and Angluin, D. Transformers as recognizers of formal languages: A survey on expressivity, 2023.

Telgarsky, M. Benefits of depth in neural networks. In Feldman, V., Rakhlin, A., and Shamir, O. (eds.), 29th Annual Conference on Learning Theory, volume 49 of Proceedings of Machine Learning Research, pp. 1517 1539, Columbia University, New York, New York, USA, 23 26 Jun 2016. PMLR. URL https://proceedings. mlr.press/v49/telgarsky16.html.

Turkoglu, M. O., D Aronco, S., Wegner, J. D., and Schindler, K. Gating revisited: Deep multi-layer rnns that can be trained. IEEE Transactions on Pattern Analysis and Machine Intelligence, 44(8):4081 4092, 2021.

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L., and Polosukhin, I. Attention is all you need. In Advances in Neural Information Processing Systems 30, 2017.

Wang, Z., Wang, C., Xu, X., Zhou, J., and Lu, J. Quantformer: Learning extremely low-precision vision transformers. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2022.

Wei, C., Chen, Y., and Ma, T. Statistically meaningful approximation: a case study on approximating turing machines with transformers, 2021.

Yao, S., Peng, B., Papadimitriou, C. H., and Narasimhan, K. Self-attention networks can process bounded hierarchical languages. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing, 2021.

Yun, C., Bhojanapalli, S., Rawat, A. S., Reddi, S., and Kumar, S. Are transformers universal approximators of sequence-to-sequence functions? In International Conference on Learning Representations, 2020.

Zhang, Y., Backurs, A., Bubeck, S., Eldan, R., Gunasekar, S., and Wagner, T. Unveiling transformers with lego: a synthetic reasoning task, 2023.

Transformers, Parallel Computation, and Logarithmic Depth

A Supplemental Preliminaries

A.1 Further Details about Transformers

We discuss a few minor technicalities and modifications of the self-attention unit (Definition 2.1) and transformer model (Definition 2.2) defined in Section 2.1 that are necessary for readers looking for a comprehensive understanding of the proofs of our theoretical results.

Fixed-Bit Precision Arithmetic. As discussed in Section 2.1, we assume that all numbers that appear in the intermediate products and outputs of self-attentions are representable with p-bit precision arithmetic, where p = Θ(log N). While the details of fixed-precision arithmetic will be uninteresting to most readers, it is necessary to explain precisely what we mean in order to ensure that proofs of results like Theorem 3.4 are sound. Throughout the paper, we allow p to depend on of constants, such as γ, δ, and ϵ.

Concretely, we assume that all query, key, and value embeddings Q(X), K(X), V (X) evaluated on all inputs contain scalar values z R that are polynomially bounded (i.e. |z| exp(O(p)) = N ζ for sufficiently large constant exponent ζ > 0) and are inverse-polynomially discretized (i.e. z N ζ Z). Depending on the desired exponent ζ, some p = Θ(log N) can be chosen to guarantee this property. While we do not formally analyze the precision needed to approximate the particular embeddings employed by our proofs, we note that our recurring sinusoidal embeddings (e.g. Lemma D.1) can be discretized without losing their central properties and that discretizations of the restricted isometry embeddings of Proposition B.1 are analyzed by Sanford et al. (2023).

Rather than stipulating a particular bounded-precision implementation that computes the output of a self-attention unit must be implemented, we specify a rounding constraint that any computational implementation of a self-attention unit must satisfy. Precisely, we require that any output round to the same inverse-polynomial discretization as the true mathematical attention.

Definition A.1. For a self-attention unit f Attn N m, let ˆf be an finite-precision implementation of that unit. We say that ˆf is a valid implementation if

f(X) ˆf(X) = O 1

This definition is only to establishing the fact that self-attention units with sufficient margins can precisely compute hardmax outputs in Lemma A.2 and to showing that MPC models can indeed compute the outputs precisely in Theorem 3.4.

Hardmax Attention. While we exclusively consider attention units with the softmax, our constructions periodically rely on the exact computation of averages of embeddings. We define the hardmax operator to allow the consideration of discrete averaging operations. For some v RN, let

hardmax(X)i =

( 1 |Imax(v)|, if i Imax(v)

0 otherwise,

where Imax(v) = {i [N] : vi = maxi vi }.

We show that bounded-precision softmax self-attention units that satisfy a margin property can be modified slightly to have identical outputs to an analogous hardmax unit.

Lemma A.2. Let f Attn N m be a self-attention unit with precision p = Θ(log N) and embedding functions Q, K, V such that for some fixed 1 ξ = N O(1) and every X RN m and i [N]:

A(X)i,i max i A(X)i,i ξ, i Imax(A(X)i),

where A(X) = Q(X)K(X)T. Then there exists a self-attention unit f Attn N m with a valid p -bit implementation with p = O(p) satisfying f (X) = hardmax(A(X))V (X).

The proof of Lemma A.2 is provided in Appendix F.

Transformers, Parallel Computation, and Logarithmic Depth

Start Tokens. Our technical proofs are occasionally simplified by including a dummy token whose value is passed in self-attention layers as a default or null value. For example, in the proof of Lemma D.2, the dummy token handles the case where the reference token does not appear previously in the sequence. While we believe that this extra token is not necessary for our technical arguments, we include it for the sake of simplicity.

We model this dummy token as a start-of-sequence token X0. Concretely, if we employ X0 in a self-attention f Attn N m which takes as input X, we instead treat f as an attention unit in Attn N+1 m that operates on (X0, X1, . . . , XN). We assume that X0 is constant-valued, and therefore never both to pay attention to its outputs; it s only relevance is via its key and value embeddings K0(X0), V0(X0) Rm. If X0 is unmentioned, we assume that it does not exist, or is set such that its key embedding inner products are all zero.

Supplemental Chain-of-Thought Tokens. We periodically (see Theorem B.3 and the proofs of Corollaries 3.5 and 4.3) consider transformers with supplemental blank chain-of-thought tokens appended to the end of the sequence. Unlike the start token, these are only constant at initialization and may be used deeper in the model to perform meaningful computations.

Let Transformer N,M m,L,H,din,dout denote transformers with M N extra blank elements appended to the input sequence. Concretely, we represent T Transformer N,M m,L,H,din,dout as some T Transformer M m,L,H,din,dout and define the output T(X) for X RN din by letting Y RM din for Y1:N = X and YN+1:M = 0, and letting T(X) = T (Y ).

B Proofs from Section 3.1

B.1 Proof of Lemma 3.2

Lemma 3.2. For any β, s, N N, there exists a transformer routeβ,s Transformer N m,1,1 with m = O(s4β log N) satisfying routeβ,s(Sent) = Rcvd for any valid (β, s)-routing (Sent, Rcvd).

The proof relies on a sparse propagation sequential primitive, which complements the sparse averaging primitive of Sanford et al. (2023). For any Q d, N, on input X = (X1, . . . , XN) RN d with Xi = (zi, Si) Rd Q [N]Q and bi = |{Sj i : j [N]}| Q, we define

sparse Propagate Q,d(X)i =

Sj i zj if bi > 0,

0 otherwise.

Closely following the argument of Sanford et al. (2023), we show in Proposition B.1 that there is a self-attention unit with embedding dimension m = max(d, O(q log N)) that computes sparse Propagate Q,d. This construction is a key component of the single-layer transformer used in the proof of Lemma 3.2.

Proposition B.1. For any b N and d, there exists a self-attention unit sparse Propagate Q,d Attn N m,p for m = d + O(Q log N) and p = O(log N), which, given any input X with Xi = (zi, Si, 0) Rd [N] Q {0}m Q d such that bi = |{Sj i : j [N]}| Q for all i, has output sparse Propagate Q,d(X) satisfying

sparse Propagate Q,d(X)i = 1

The proof of Proposition B.1 appears in Appendix F.

Proof of Lemma 3.2. We construct a single-layer single-headed transformer with query, key, and value embeddings Q, K, V and output MLP ψ. Q, K, V can be decomposed as Q = Q ϕ, K = K ϕ, V = V ϕ, for some input MLP ϕ and embeddings Q , K , V . We fix Q , K , V to be the respective embeddings of the self-attention unit with embedding dimension m from Proposition B.1 that computes Y = sparse Propagates,m(X) for XSrc = (z Src, SSrc) for every Src [N] to be determined. Hence, the proof entails designing element-wise encoders ϕ = (ϕ1, . . . , ϕN) and decoders ψ = (ψ1, . . . , ψN) that compute Rcvd from Sent, using sparse Propagates,m as an intermediate step. A high-level overview of the proof construction is visualized in Figure 3.

Transformers, Parallel Computation, and Logarithmic Depth

sparse Propagate

2Y2 = 2 sparse Propagate((z, S))2

2 α 2Src 2Dest

Y1 = (0, 2)

g Dest g Msg g Src α

S1 = (0, 2)

S3 = (2, 4)

g Dest g Msg g Src α

(z, S) Y Rcvd ψ

Figure 3. A visualization of the construction used to prove Lemma 3.2 in three phases the encoding of each input Sent Src as embedding z Src and subset SSrc with ϕ; the combination of those embeddings into YDest via the simulation of sparse Propagates,m((z, S)); and the decoding of each YDest into output Rcvd Dest with ψ. The figure provides an example of the encoding and decoding where machines 1 and 3 transmit messages to machine 2. Multiple hashing is used to compute z1 and z3 by encoding each message in multiple fixed-location packets in embedding space space. This redundancy ensures the possibility of machine 2 decoding Rcvd2 from Y2, due to each message occurring alone at least once in the encoding.

On input Sent Src, we use the encodings QSrc, KSrc, VSrc to specify that all tokens Dest with Dest Sent Src (or equivalently, all Dest with Src Rcvd Dest) should receive a copy of the encoding of Sent Src. That is, we set SSrc := {Dest Sent Src} for each Src [N]. This ensures that Y satisfies

YDest = 1 |Rcvd Dest|

Src Rcvd Dest z Src.

While it s tempting to simply set each z Src Rm equal to a (βs)-dimensional vectorization of Sent Src, it is unclear how to extract Rcvd Dest from each YDest, since each average performed by sparse Propagates,m will combine multiple vector embeddings in a shared space. In order to avoid these troubles, we employ a multiple hasing-based encoding that treats messages as packets identified by a message, a source, a destination, and a validity token that can be used to determine whether a message is uncorrupted. We include multiple copies of each packet in the encoding z Src. For notational ease, we represent each z Src Rm as a collection of packets

z Src = (g Msg Src,j, g Src Src,j, ] Dest Src,j, αSrc,j)j [m ] (Zβ 2p [N] [N] {0, 1})m ,

where m = m (3 + β).

Transformers, Parallel Computation, and Logarithmic Depth

To sparsely and redundantly encode each Sent Src as z Src, we encode outgoing messages as packets by utilizing the matrix A guaranteed by the following fact (which we use with n := N 2, b := s2, and m := d = O(s4 log N)).

Fact B.2. For any n, b n, and d 12b2 ln n , there exists a binary matrix A {0, 1}n d such that, for every subset S [n] with |S| b, the columns of the sub-matrix AS {0, 1}|S| d contains all S-dimensional elementary vectors, i.e., e1, . . . , e|S| is a subset of the columns of AS.

The proof of Fact B.2 is at the end of the section. We use the following rule to determine which (if any) message to encode as a packet at each Src [N] and j [m ]. We let A(Src,Dest),j = AN(Src 1)+Dest,j for notational convenience.

(Msg, Src, Dest, 1) if (Msg, Dest) Sent Src and A(Src,Dest),j = 1 and A(Src,Dest ),j = 0, Dest Sent Src \ {Dest} , ( 0, 0, 0, 0) otherwise.

In Figure 3, this encoding is visualized in the tables of Machine 1 and Machine 3, where the entirety of each message is encoded in two fixed and distinct locations in the embeddings z1 and z3, alongside metadata about the source of message and the validity α. Each message is encoded as multiple identical packets in different embedding dimensions and a large fraction of embedding locations are left blank. These features are critical for the proper evaluation of the decoding step ψ.

We analyze the Y = sparse Propagateβ,m(X) outputs, letting

YDest = (YDest,1, . . . , YDest,m ), YDest,j (Rβ R R R)m ,

with all numbers represented with p-bit fixed precision. This analysis shows that there exists an element-wise decoder MLP ψ satisfying ψDest(YDest) = Rcvd Dest for all Dest [N]. For any j [m ], observe from the definition of z Src and sparse Propagates,m that

Msg Dest,j, Src Dest,j, Dest Dest,j, αDest,j

= 1 |Rcvd Dest|

Src Rcvd Dest

g Msg Src,j, g Src Src,j, ] Dest Src,j, αSrc,j .

Before formally analyzing this construction, we motivate its utility with Figure 3. The encoding 2Y2 of Machine 2 contains four clean rows j with 2 α2,j = 1, two corrupted rows with 2 α2,j = 2, and one blank row with 2 α2,j = 0.

The blank row contains no information about any incoming messages, since neither Machine 1 nor Machine 3 encoded messages as packets in these locations. The fact that 2 α2,j = 0 certifies the blankness of this row, and hence, the decoder ψ can ignore it.

The corrupted rows correspond to locations where both Machine 1 and Machine 3 saved messages as packets. As a result, the corresponding embedding Y2,j = 1

2(z1,j + z3,j) is an average of two non-zero embeddings and is hence corrupted. Because 2 α2,j = 2, the decoder ψ detects the corruption and ignores it when computing Rcvd2.

The clean rows are locations where exactly one of Machine 1 and Machine 3 encoded a message. Hence, these messages can be cleanly understood by the decoder ψ, which simply validates the cleanliness of the row with 2 α2,j = 1, determines whether Machine 2 is indeed the target recipient of the respective message, and saves all such messages in the decoding Rcvd2.

We prove the validity of this intuition by ensuring that the encoding scheme successfully encodes each incoming message in a clean row and that the category of each row (blank, corrupted, or clean) can be detected by the decoder ψ. We observe the following sequence of facts about every YDest. Let

Relevant Dest := {(Msg, Src , Dest ) : Src Rcvd Dest, (Msg, Dest ) Sent Src }

denote the set of all messages sent by sources of messages sent to Dest.

Transformers, Parallel Computation, and Logarithmic Depth

1. Consider any outgoing message (Msg, Src , Dest ) Relevant Dest. By the property of A guaranteed by Fact B.2, there exists some j such that A(Src ,Dest ),j = 1 and A(Src ,Dest ),j = 0 for every (Src , Dest ) Relevant Dest \ {(Src , Dest )} . As a result of the definition of the encoding z and the averaged representation of YDest:

YDest,j = 1 |Rcvd Dest| (Msg, Src , Dest , 1) . (1)

2. Conversely, if αDest,j = 1/|Rcvd Dest|, then there exists a unique (Msg, Src , Dest ) Relevant Dest such that (1) is satisfied.

3. If at least one message is received, then the minimal nonzero value of αDest is 1/|Rcvd Dest|.

We design ψDest to uniquely identify Rcvd Dest from YDest as follows. If at least one message is received, then 1/|Rcvd Dest| can be identified by finding the smallest nonzero value of αDest. The decoder ψ inspects every YDest,j satisfying αDest,j = 1/|Rcvd Dest|, which therefore satisfies

|Rcvd Dest| (Msg Dest,j, Src Dest,j, Dest Dest,j) Relevant Dest.

Thus, if |Rcvd Dest| Dest Dest,j = Dest, then |Rcvd Dest| (Msg Dest,j, Src Dest,j) Rcvd Dest, and ψ encodes it as such.

Fact B.2. For any n, b n, and d 12b2 ln n , there exists a binary matrix A {0, 1}n d such that, for every subset S [n] with |S| b, the columns of the sub-matrix AS {0, 1}|S| d contains all S-dimensional elementary vectors, i.e., e1, . . . , e|S| is a subset of the columns of AS.

Proof. Let col(A) denote the set of columns of A. We use the probabilistic method and consider A with iid entries Ai,j Bernoulli( 1 b+1). We bound the probability of failure:

s.t. e1, . . . , e|S| col(AS) b nb Pr [ei col(AS)]

1 1 b + 1 1 1 b + 1

nb+1 1 1 e(b + 1)

nb+1 exp d e(b + 1)

< exp (b + 1) ln n d 3(b + 1)

Therefore, there exists a matrix A with the claimed property.

B.2 Proof of Theorem 3.1

We give a generalization of Theorem 3.1 that simulates a broader family of MPC protocol, including those with more than n machines (i.e. γ δ). We accommodate this generalization by simulating MPC protocols with the generalized transformer family Transformer N,M m,L,H detailed in Appendix A with supplemental blank chain-of-thought tokens.

Theorem B.3 (Generalization of Theorem 3.1). For constant γ, δ > 0 and any potentially randomized R-round (γ, δ)-MPC protocol π on nin input words and nout nin output words, there exists a transformer T Transformer N,M m,L,H with N = nin, M = max(nin, O(n1+γ δ in )), m = O(n4δ in log nin), L = R + 1, H = O(log log nin) such that

T(Input):nout = π(Input).

Transformers, Parallel Computation, and Logarithmic Depth

Attention head Attention head

Sparse propagation

Softmax attention

Attention head

Multiple hashing

Figure 4. To simulate MPC, the local computation within each machine is pushed inside Q( ), K( ), V ( ), and then the pairwise attention matrix performs message routing. To ensure proper routing and also that the outputs of Q( ), K( ), V ( ) are all tall-and-skinny matrices, the construction carefully utilizes both multiple hashing and sparse propagation.

Theorem 3.1 is an immediate consequence of Theorem B.3 by noting that M = N for sufficiently large nin when γ < δ. Its central construction is summarized in Figure 4.

Proof. Consider any MPC protocol π with q = O(n1+γ δ in ) machines and s = O(nδ in) local memory that, following the notation of Definition 2.3, maps Input Znin 2p to Output Znout 2p with intermediates Machine In(1), . . . Machine In(R)

and Machine Out(1), . . . , Machine Out(R) and deterministic functions (Localr,i)r [R],i [q] with

Machine Out(r) i = Localr,i(Machine In(r) i ).

To simulate the protocol, we let every machine i [q] correspond to a particular position in the transformer s context. A transformer that simulates π can then be constructed that consolidates Input onto nin/s machines to match Machine In(1); computes Machine In(r+1) from Machine In(r) for each r = 1, . . . , R 1; and computes and properly distributes Output from Machine In(r). These three elements of the construction exist due to the following lemmas, which are proved later.

Lemma B.4. For any MPC protocol π with local memory s and q machines with nin-word inputs, there exists a transformer init Transformernin,max(nin,q) s,1,1,din,dout with din = 1 and dout = s, which, given Input Zn 2p, has output satisfying init(Input) = Machine In(1).

Lemma B.5. For any R-round MPC protocol π with local memory s and q machines and any r [R 1], there exists a transformer round(r) Transformerq m,1,H,din,dout with H = O(log log q), m = O(s4 log q), and din = dout = s which, given any valid input X = Machine In(r) Zq m 2p under the MPC protocol in vectorized form, has output satisfying round(r)(X) = Machine In(r+1).

Lemma B.6. For any R-round MPC protocol π with local memory s and q machines with nout-word output, there exists a transformer final Transformerq,max(nout,q) s,1,1,din,dout for din = s and dout = 1, which, given input X = Machine In(R), has output final(X) with final(X)i,1 = Outputi Z2p.

Transformers, Parallel Computation, and Logarithmic Depth

The proof immediate from the three lemmas. We construct the final transformer T by stacking the single-layer constructions as a single transformer with embedding dimension m:

T = final round(R 1) round(1) init.

The proofs of Lemmas B.4 and B.6 rely on simple constructions with fixed attention matrices and appear in Appendix F. The proof of Lemma B.5 relies on Lemma 3.2 and is proved in the following section.

Proof of round(r) Construction. To prove the existence single-layer transformer that simulates round(r), we separate the computational task into two steps: (i) obtaining Machine Out(r) from Machine In(r) and (ii) obtaining Machine In(r+1)

from Machine Out(r). Because the former requires no communication between machines, we can encode that conversion in the input MLP to the transformer.

The nontrivial part of the reduction is thus the latter step, which we obtain by utilizing multiple single-headed attention units routeβ,s of Lemma 3.2 to route messages of different sizes to their recipients. The difficulty in this task is the mismatch in functionality between the two computational models: while the MPC model ensures that each recipient automatically receives its intended messages, transformers must implement this functionality manually, while ensuring that multiple messages do not overwrite one another.

The following lemma implements that routing functionality for all messages, using different attention heads depending on the size of the message. We prove Lemma B.5 at the end of the section as a simple modification of Lemma B.7.

Lemma B.7. For any R-round MPC protocol π with local memory s and q machines and any r [R 1], there exists a transformer route(r) Transformerq m,1,H with H = O(log log q) and m = O(s4 log q), which, given any valid input X = Machine Out(r) Zq m 2p under the MPC protocol in vectorized form, has output satisfying route(r)(X) = Machine In(r+1).

Because at most s messages can be shared and received by each machine, and each message is of size at most s, we can prove an single-headed alternative to Lemma B.7 with a somewhat suboptimal dependence on embedding dimension. By applying by Lemma 3.2 with message size β = s, bounded number of messages s, and context length N = q, there exists a transformer routes,s with H = 1 and m = O(s5 log q) that computes Machine In(r+1) from Machine Out(r+1) by regarding each outgoing message as belonging to Zs 2p by adding padding dimensions as needed.

We improve the embedding dimension to m = O(s4 log q) by running in parallel O(log log N) transformers guaranteed by Lemma 3.2 that encode differently sized messages. The number of heads H increases at a doubly-logarithmic rate because of a doubling trick employed on the size of message encodings used by constituent part.

Proof. We describe an implementation of route(r) by considering any fixed input Machine Out(r) Zq m 2p . For each i [q] and some integer sequence 1 = β0 < β1 < < βH = s + 1, we partition Machine Out(r) i into H disjoint subsets as follows. For any h [H], let

Senth i := n (Msg, Dest) Machine Out(r) i : dim(Msg) [βh 1, βh] o ,

Rcvdh i := n (Msg, Src) Machine In(r+1) i : dim(Msg) [βh 1, βh] o ,

and note that Machine Out(r) i = SH h=1Senth i and Machine In(r+1) i = SH h=1Rcvdh i .

For each h [H], note that dim(Msg) βh, and Senth i = Rcvdh i s/βh 1. As a result, Lemma 3.2 guarantees the existence of a single-headed transformer route(r) h such that route(r) h (Senth) = Rcvdh) with embedding dimension mh Cs4βh log(q)/β4 h 1 for some sufficiently large universal constant C.

We defined route(r) as the computation of route(r) 1 , . . . , route(r) H as H parallel heads of self-attention with disjoint embeddings concatenated into in m-dimensional embedding space with m = PH h=1 mh. We conclude by letting

( 1 if h = 0, min(2β3 h 1, q + 1) if h [H],

Transformers, Parallel Computation, and Logarithmic Depth

noting that βH = q + 1 for H = O(log log q), and bounding m:

Cs4 log(q)βh

β4 h 1 2Cs4 log(q)

2Cs4 log(q)

1 2h 1 = O(s4 log q).

Proof of Lemma B.5. To simulate a round of MPC protocol π by mapping Machine In(r) and ρr to Machine In(r+1), the single-layer transformer round(r) first computes Machine Out(r) element-wise and then properly routes messages in Machine Out(r) to their proper destination. We can define round(r) = route(r) Localr for route(r) in Lemma B.7 and Localr,i(Machine In(r) i , ρr,i) = Machine Out(r) i . This can be immediately constructed as a single-layer transformer by prepending the embeddings Q, K, V of the construction of route(r) with Localr, using Q Localr, K Localr, V Localr as the embeddings of round(r).

B.3 Additional Graph Problems Solvable by Log-Depth Transformers

Theorem 8.1 and Corollary 8.2 of Coy & Czumaj (2022) give efficient MPC protocols for other graph problems besides connectivity, and therefore, as corollaries of Theorem 3.1, we also obtain log-depth transformers for these problems.

Corollary B.8 (Spanning forest construction). For any constant ϵ (0, 1) and any D N, there exists a transformer in Transformer N m,L,H with m = O(N ϵ), H = O(log log N), and L = O(log D) that computes a rooted spanning forest of any input graph G = (V, E) with |V |, |E| = O(N) where each connected component has diameter at most D.

Corollary B.9 (Minimum spanning forest construction). For any constant ϵ (0, 1) and any DMSF N, there exists a transformer in Transformer N m,L,H with m = O(N ϵ), H = O(log log N), and L = O(log DMSF ) that identifies the connected components of any input graph G = (V, E) with |V |, |E| = O(N) and poly(N)-bounded integer weights whose minimum spanning forest has diameter at most DMSF .

C Proofs from Section 3.2

C.1 Proof of Theorem 3.4

As in Appendix B.2, we give and prove a generalized version of Theorem 3.4 that broadens the family of considered transformers to include masked models and those that contain extra blank chain-of-thought tokens, using notation from Appendix A.

Theorem C.1 (Generalization of Theorem 3.4). For any transformer T Transformer N,M m,L,H (or Mask Transformer N,M m,L,H)

with m H = O(N δ) for δ (0, 1) and M = Θ(N 1+α) for α 0 and for any δ (δ, 1), there exists an O( L(1+α)

δ δ )-round (1 + 2α + δ , δ )-MPC protocol with q = O(M 2) machines with s = O(N δ ) local memory that outputs the same sequence as T(X) for all X RN.

Theorem 3.4 is an immediate consequence by setting M := N and α := 0.

Proof. It suffices to show that an O( 1+α

δ δ)-round MPC protocol π that simulates a single-layer transformer T Transformer M m,m,m,1,H with m-dimensional input and output embeddings since a depth-L transformer can be constructed by applying L such protocols sequentially. Moreover, we can ignore the difference between the input context length N and the context length with padding M by assuming that the input contains M tokens.

Concretely, we consider H heads with embeddings (Qh, Kh, Vh)h [H], element-wise output MLP ψ = (ψ1, . . . , ψM), and any fixed masks Λ1, . . . , ΛH { , 0}M M. We show that there exists some π such that for any Input = X RM m,

h=1 softmax(Qh(X)Kh(X)T + Λh)Vh(X)

Transformers, Parallel Computation, and Logarithmic Depth

(Qh,i, Kh,i, Vh,i)

Token machines

Key/value propagation machines

Query propagation machines

Inner product machines

Figure 5. This construction employs M 2 inner product machines to compute the entries of the softmax matrix, and M token machines to compute all values of Q( ), K( ), V ( ). What is most complex about the construction are the additional machines and message routing needed to propagate these values efficiently between the inner product machines and the token machines, in particular carefully aggregating the output of the attention mechanism and computing its normalization. To this end, the protocol uses additional machines, organized into a tree with branching factor b = O(N δ δ) and depth D = O( 1+α

where numbers in X and all intermediate products of the transformer computation can be represented with p = O(log M) bit precision.

Our MPC protocol π, which will use q = O(M 2) machines and s = Θ(N δ ) words of local memory per machine, assigns each of the q machines to one of four possible roles: token machine, inner product machine, query propagation machine, and key/value propagation machine. We describe these machines below. For the sake of readability, we identify machines with easily interpretable descriptions and use the bijection ID to map each of those to a token in [q] that is used for routing messages. Our protocol has two important parameters: b = s/(4m H) = O(N δ δ) is the branching factor of the protocol, and D = logb(M) = O( 1+α

δ δ) is the depth of the protocol.

At a high level (see Figure 5 for a corresponding diagram), the protocol involves computing all intermediate products of the of a transformer unit by performing MLP computations in N token machines, computing inner products in N 2 inner product machines, and using O(N 2) other propagation machines arranged in trees to share information between the two in O(D) rounds. The protocol draws inspiration from Appendix C.6.1 of Sanford et al. (2023), which uses a similar construction to simulate transformers with CONGEST protocols on fixed graphs. It is also similar to the MPC implementation of the MPI All Reduce functionality (MPICH, 2023) described by Agarwal et al. (2014).

Machine i [M] is a token machine that performs all element-wise computation on the ith token embedding, including the computation of (Qh,i(Xi), Kh,i(Xi), Vh,i(Xi))h [H] and the final MLP output ψi. Let ID(i) = i.

Machine (i, i ) [M]2 is an inner product machine designed to compute the inner products (Qh,i(Xi)TKh,i (Xi ))h [H].

Machine (Q, i, d, k) for token i [M], depth d [D 1] and position k [bd] is a query propagation machine. This machine is responsible for handling communication of query tokens (Qh,i(Xi))h [H] and of all partially-computed attention outputs for the ith token between token machine i and inner product machines (i, i ) for

i Descendantsd,k := b D d(k 1), . . . , b D dk [M].

Transformers, Parallel Computation, and Logarithmic Depth

Concretely, if ℓ= 1, then the machine communicates with token machine i and query propagation machines (Q, i, d + 1, k ) for

k Childrenk := {b(k 1) + 1, . . . , bk} .

If ℓ= D 1, then it communicates with inner product machines (i, i ) for i Childrenk [M] and query propagation machine (Q, i, d 1, k/b ). Otherwise, it communicates with query propagation machines (Q, i, d 1, Parentk), for Parentk := k/b , and (Q, i, d + 1, k ) for k Childrenk.

Machine (KV, i, d, k) is a key/value propagation machine. This machine is analogous to a query propagation machine, except that it is responsible for the communication of key and value tokens (Qh,i(Xi), Vh,i(Xi))h [H] between token machine i and inner product machines (i, i ) for i Descendantsd,k.

Since the total number of machines is q = M + M 2 + M PD 1 d=1 bd = O(M 2), we conclude that the global memory of the protocol is qs = O(N 2+2α+δ ), which means the protocol is (1 + 2α + δ , δ )-MPC. We simulate the transformer using a four stage protocol using 2D + 3 = O( 1+α

δ δ) rounds of MPC computation.

Stage 1: Token Dispersion. Because the input to an MPC protocol Input = X is divided equally among machines 1, . . . , Mm H/s , the first round of MPC computation routes each input token Xi to its respective token machine. This is completed by setting (i, Xi) Machine Out(1) i if (i, Xi) Machine In(1) i . Thus, Machine In(2) i = {(Src, Xi)} for all token machines i [M].

Stage 2: Embedding Propagation. In rounds 2, . . . , D + 1, π computes the respective key, query, and value embeddings in each token machine and propagate them to respective inner product machines using the query and key/value propagation machines. Concretely:

In round 2, each token machine i (whose memory contains Xi) computes m-dimensional embeddings embeddings Qi := (Qh,i(Xi))h [H], Ki := (Kh,i(Xi))h [H], Vi := (Vh,i(Xi))h [H]. It transmits each embedding to the respective depth-1 query and key/value propagation machine nodes, while also preserving knowledge of its own Xi. (In all further rounds, we assume that ((i, Xi)) Machine Out(r) i to ensure that token machine i can compute the skip-level connection at the end.) That is,

Machine Out(2) i = {(i, Xi)}

{(ID(Q, i, 1, k ), Qi) : k Children1}

{(ID(KV, i, 1, k ), (Ki, Vi)) : k Children1} .

Note that the total amount of messages sent is b m H + 2b m H + m s and that the only machines receiving messages are size m-messages by token machines and size 4m H messages by query and key/value propagation machines.

In rounds r {3, . . . , D}, each query and key/value propagation machine of depth d = r 2 passes embeddings onto their successors. That is,

Machine Out(r) ID(Q,i,d,k) = {(ID(Q, i, d + 1, k ), Qi) : k Childrenk} ,

Machine Out(r) ID(KV,i,d,k) = {(ID(KV, i, d + 1, k ), (Ki, Vi)) : k Childrenk} .

In round D + 1, the depth-(D 1) query and key/value propagation machines pass their embeddings onto their respective inner product machines. That is,

Machine Out(D+1) ID(Q,i,D 1,k) = {(ID(i, k ), Qi) : k Childrenk [M]} ,

Machine Out(D+1) ID(KV,i,D 1,k) = {(ID(k , i), (Ki, Vi)) : k k Childrenk [M]} .

Transformers, Parallel Computation, and Logarithmic Depth

Stage 3: Softmax Computation. In rounds D + 2, . . . , 2D + 2, computes each inner product and iteratively builds up each attention output by accumulating partial softmax computations. For each query propagation machine (Q, i, d, k) and h [H], we let Si,d,k,h and Zi,d,k,h denote its partial normalization and softmax computations respectively. That is,

Zi,d,k,h = X

i Descendantsd,k exp(Qh,i(Xi)TKh,i (Xi ))1 {Λi,i = 0}

k Childrenk Zi,d+1,k ,h if d D 1, exp(Qh,i(Xi)TKh,k(Xk))1 {Λi,k = 0} if d = D.

Si,d,k,h = 1 Zi,d,k,h

i Descendantsd,k exp(Qh,i(Xi)TKh,i (Xi ))Vh,i (Xi )1 {Λi,i = 0}

k Childrenk Zi,d+1,k ,h

Zi,d,k,h Si,d+1,k ,h if d D 1,

Vh,k(Xk)1 {Λi,k = 0} if d = D;

Note that Si,0,1,h = (softmax(Qh(X)Kh(X)T + Λh)Vh(X))i and let Si,d,k = (Si,d,k,h)h [H] RH m and Zi,d,k = (Zi,d,k,h)h [H] RH

In round D + 2, each inner product machine computes its respective inner products and passes its partial softmax computations to its parent query propagation machine. As a result of round D + 1, each inner product machine (i, i ) recently received the embeddings necessary to compute the relevant inner product:

Machine In(d+2) ID(i,i ) = {(ID(Q, i, D 1, Parenti), Qi), (ID(KV, i , D 1, Parenti ), (Ki , Vi ))} .

It propagates the respective partial computations Si,D,i and Zi,D,i as follows:

Machine Out(D+2) ID(i,i ) = {(ID(Q, i, D 1, Parenti), (Si,D,i , Zi,D,i ))} .

Note that each depth-(D 1) query propagation machine receives messages of size at most b (m + 1)H s.

In rounds r {D + 3, . . . , 2D}, partial softmax computations are received by query propagation machines of depth d = 2D + 1 r, added together, and passed along to their parent machines. That is, given

Machine In(r) ID(Q,i,d,k) = {(ID(Q, i, d + 1, k ), (Si,d+1,k , Zi,d+1,k )) : k Childrenk} ,

each respective machine computes Si,d,k and Zi,d,k recursively and propagates

Machine Out(r) ID(Q,i,d,k) = {(ID(Q, i, d 1, Parentk), (Si,d,k, Zi,d,k)} .

In round 2D + 1, the top-most query propagation tokens pass their partial sums to the token machines:

Machine Out(2D+1) ID(Q,i,1,k) = {(i, (Si,1,k, Zi,1,k))} .

In round 2D + 2, the token machines compute their respective output of the transformer, T(X)i. Given input

Machine In(2D+2) i = {(k , (Si,1,k , Zi,1,k )) : k Children1} {(i, Xi)} ,

the token machine i computes Si,0,1 and Hi,0,1 and then

h=1 softmax(Qh(X)Kh(X)T + Λh)T i Vh(X)

h=1 Si,0,1,h

This quantity is used as an intermediate product for the final phase of computation.

Transformers, Parallel Computation, and Logarithmic Depth

Stage 4: Token Compression. We invert Stage 1 by properly compressing the MPC output in the final round 2D + 3. That is, we let Machine Out(2D+2) i = {( im H/s + 1, T(X)i)} for each token machine i [M], which ensures that the outputs are condensed in the proper order in machines 1, . . . , Mm H/s .

Precision Analysis. In order for the proof to be fully sound, care must be taken to ensure that the computation of each self-attention output Si,0,1,h is handled with proper numeric precision, as discussed in Appendix A. We show that each Si,0,1,h is a valid implementation of its corresponding self-attention unit, per Definition A.1.

To do so, we let ˆSi,d,k,h and ˆZi,d,k,h denote the p-bit representations of Si,d,k,h and Zi,d,k,h, where scalars of ˆSi,d,k,h and log( ˆZi,d,k,h) are represented as discretized rational numbers z satisfying |z| 1

22p/2 and z 2p/2 Z. For some sufficiently small p = Θ(p), we assume that all embeddings Qh(X), Kh(X), Vh(X) have scalars z satisfying |z| 1

and z 2p /2 Z. We prove that for each h [H],

Si,0,1,h ˆSi,d,k,h = O 1

Boundedness of intermediate representations is not an issue because

log(Zi,d,k,h) O(log(N) + max i,i |Q(X)T i K(X)i |) = exp(O(p )),

and Si,d,k,h V (X) 2p /2.

It remains to show that that all intermediate representations are sufficiently close to their exact counterparts. We prove the following via an inductive argument for d = D, D 1, . . . , 0:

log(Zi,d,k,h) log( ˆZi,d,k,h) (2b)D d

Si,d,k,h ˆSi,d,k,h 2p /2(8b)D d

If (3) holds for d = 0, then the claim holds for sufficiently large p = Θ(p ).

For the base case D, we verify (3) by

Si,D,k,h ˆSi,D,k,h = Vh,k(Xk)1 {Λi,k = 0} ˆSi,D,k,h 1 2p/2 ,

due to the ability to access Vh,k(Xk) and round it directly. We verify (2) due to the immediate access to and boundedness of Qh,i(Xi)TKh,k(Xk):

|log(Zi,d,k,h)| Qh,i(Xi)TKh,k(Xk) Qh,i(Xi) 2 Kh,k(Xk) 2 N 2p /2.

We prove the inductive step for d 1, assuming that the inductive hypothesis holds for d. We first address ˆZi,d 1,k,h by employing the Lipschitzness of the log-sum-exp function.

log(Zi,d 1,k,h) log( ˆZi,d 1,k,h) 1 2p/2 +

k exp(log(Zi,d,k ,h))

k exp(log( ˆZi,d,k ,h))

log(Zi,d,k ,h) log( ˆZi,d,k ,h)

1 2p/2 + b (2b)D d

2p/2 (2b)D d+1

Transformers, Parallel Computation, and Logarithmic Depth

To obtain (3) for d 1, we first note that for sufficiently large p: 1 ˆZi,d,k ,h Zi,d 1,k ,h Zi,d,k,h ˆZi,d 1,k ,h

ˆZi,d 1,k,h

log ˆZi,d,k ,h Zi,d,k ,h

log Zi,d 1,k,h

ˆZi,d 1,k,h

4 (2b)D d+1

We conclude by using the fact that each Si,d 1,k,h is a convex combination of other Si,d,k,h.

Si,d 1,k,h ˆSi,d 1,k,h 1 2p/2 + X

Zi,d,k ,h Zi,d 1,k ,h Si,d,k ,h ˆZi,d,k ,h ˆZi,d 1,k ,h ˆSi,d,k ,h

Zi,d,k ,h Zi,d 1,k ,h

Si,d,k ,h ˆZi,d,k ,h Zi,d 1,k ,h Zi,d,k,h ˆZi,d 1,k ,h ˆSi,d,k ,h

Zi,d,k ,h Zi,d 1,k ,h

Si,d,k ,h ˆSi,d,k ,h

Zi,d,k ,h Zi,d 1,k ,h

1 ˆZi,d,k ,h Zi,d 1,k ,h Zi,d,k,h ˆZi,d 1,k ,h

1 2p/2 + 2p /2(8b)D d

2p/2 + 2p /2 X

Zi,d,k ,h Zi,d 1,k ,h

1 ˆZi,d,k ,h Zi,d 1,k ,h Zi,d,k,h ˆZi,d 1,k ,h

2 2p /2(8b)D d

2p/2 + 2p /2 4 (2b)D d+1

2p/2 2p /2(8b)D d+1

Owing to the fact that D and p are constants and b = N O(1), a sufficiently large choice of p guarantees that the implementation is valid.

C.2 Proof of Corollary 3.5

Corollary 3.5. Let ϵ (0, 1) be any constant, and let D N ϵ. Assume Conjecture 2.4, and suppose there exists T Transformer N m,L,H with m H = O(D1 ϵ) that decides connectivity of any input graph with connected components having diameter D. Then L = Ω(log D).

We prove Corollary 3.5 by combining Theorem C.1 and Conjecture 2.4.

Proof. Fix any D N with D N ξ for some ξ (0, 1]. Let C1 denote a cycle graph on D vertices, and let C2 denote the union of two cycle graphs each with D/2 vertices.

Suppose there is a transformer T Transformer N m,L,H with m H = O(D1 ϵ) that determines the connectivity of graphs with at most N edges and connected components with diameter at most D. We will show that it can be used to design an Θ(L)-round MPC protocol π that distinguishes graphs C1 and C2 with n = D edges.

Let π be an MPC protocol that exactly computes the output of T using taking R = O(L) rounds with local memory s = O(D1 ϵ/2) and q = O(N 2) machines, which is guaranteed to exist by Theorem C.1.

Let n := 2 D

4 and k := N

n . We design π with the same local memory and machine count to determine the identity of input graph G = (V, E) {C1, C2} provided as an arbitrary sequence of n edges. Let u V be an arbitrary vertex in G.

Using a constant number of MPC rounds, π converts G into a graph G = (V , E ) with |E | = kn + k N and diameter n + 2 D such that G is connected if and only if G = C1. We do so by letting G be composed of k copies G1, . . . , Gk

of G on separate vertices, along with k extra edges connecting the vertex corresponding to u in each Gj (say uj Gj) to

Transformers, Parallel Computation, and Logarithmic Depth

u1 G1. This ensures that the connectivity correspondence and edge count diameter bounds are met. Since G can be produced by simply copying edges from G and adding an additional edge each time an edge containing u is copied, π can produce G in O(1) rounds.

Then, π simulates π on G and returns its output. Since G is connected if and only if G = C1, this protocol suffices to distinguish C1 and C2. Because the protocol uses s = O(n1 ϵ/2) local memory and q = O(n2/ξ) machines, Conjecture 2.4 implies that π (and hence T) only exists if L = Ω(log n) = Ω(log N).

D Proofs from Section 4.1

D.1 Proof of Theorem 4.2

Theorem 4.2. For any k N and alphabet Σ with |Σ| N, there exists T Mask Transformer N m,L,H that computes hopk : ΣN (Σ { })N with m = O(1), L = log2 k + 2, and H = 1.

Proof. We design a masked transformer that implements hopk in two phases. The first two layers compute find1 X(i) for each i [N] using a similar approach to the induction heads construction of (Bietti et al., 2023). The subsequent layers employ a doubling trick to compute each find2ℓ 2 X (i) after ℓlayers.

To do so we employ two technical lemmas (which are proved in Appendix F.4) that describe the implementation of masked self-attention units that copy .

Lemma D.1. For some m d + 2, τ : [N] Rm [N], and ρ : Rm Rd, there exists an attention head look Upτ,ρ Mask Attn N m with precision p = O(log N) and m d + 2 satisfying look Upτ,ρ(X)i,:d = ρ(Xτ(i,Xi)).

Lemma D.2. For any finite alphabet Σ, m d + 2, µ1, µ2 : Rm Σ, and ρ : Rm Rd, there exists an attention head last Occurrenceµ,ρ Mask Attn N m with precision p = O(log(N |Σ|)) such that,

last Occurrence(X)i,:d =

( ρ( 0) if i < i : µ1(Xi ) = µ2(Xi), ρ(Xi ) if i = max {i < i : µ1(Xi ) = µ2(Xi)} .

The first layer obtains the previous token Xi 1 from each Xi. This is accomplished via the self-attention head look Upτ,ρ with τ(i, Xi) = i 1 and ρ(Xi) = Xi.

The second layer retrieves (find1 X(i), Xfind1 X(i)) for each i [N] by finding the most recent token whose preceding token is Xi. It does so by employing the last Occurrenceµ1,µ2,ρ primitive on the intermediate state X1 i = (Xi, Xi 1) with µ1(X1 i ) = Xi 1, µ2(X1 i ) = Xi, and ρ(X1 i ) = (i, Xi).

If find1 X(i) > 0, then last Occurrenceµ1,µ2,ρ(X1 i ) = (find1 X(i), Xfind1 X(i)).

Otherwise, it obtains 0 and performs no further passing, returning after all L layers.

If k = 1, the transformer returns T(X)i = Xfind1 X(i) = hopk(X)i.

Otherwise, let k := P log2 k j=0 kj2j for some kj {0, 1}, and let k:ℓ= Pℓ j=0 kj2j. Construct a transformer inductively to ensure that the ith output of the ℓth layer Xℓ i Rm for ℓ 2 contains an encoding of Xi, find2ℓ 2 X (i), Xfind2ℓ 2 X (i), findk:ℓ 2 X (i), Xfind k:ℓ 2 X (i)

Note that the base case holds for ℓ= 2, since findk:0 X (0) = find1 X(0) if k0 = 0 and is i otherwise.

For each ℓ= 1, . . . , log2 k + 1, we assume that the inductive hypothesis holds up to layer ℓand prove that it also holds for layer ℓ+ 1. To do so, we use a look Upτ,ρ self-attention head with τ(i, Xℓ i ) = find2ℓ 2 X (i) and

ρ(Xℓ i ) = (find2ℓ 2 X (i), Xfind2ℓ 2 X (i), findk:ℓ 2 X (i), Xfind k:ℓ 2 X (i)),

Transformers, Parallel Computation, and Logarithmic Depth

which ensures that Xℓ+1 i can encode

find2ℓ 1 X (i) = find2ℓ 2 X (find2ℓ 2 X (i))

Xfind2ℓ 1 X (i) = Xfind2ℓ 2 X (find2ℓ 2 X (i))

findk:ℓ 1 X (i) =

( findk:ℓ 2 X (find2ℓ 2 X (i)) if kℓ 1 = 1 findk:ℓ 2 X (i) if kℓ 1 = 0

Xfind k:ℓ 1 X (i) =

Xfind k:ℓ 2 X (find2ℓ 2 X (i)) if kℓ 1 = 1

Xfind k:ℓ 2 X (i) if kℓ 1 = 0.

As a result, the output of layer L = log2 k + 2 contains an encoding of

Xfind k:L 2 X (i) = Xfindk X(i) = hopk(X)i

for each i [N]. This is returned as the output of T(X).

D.2 Proof of Corollary 4.3

Corollary 4.3. Assuming Conjecture 2.4, for any constants ξ (0, 1/2] and ϵ (0, 1), and any even k = Θ(N ξ), every transformer T Mask Transformer N m,L,H with m H = O(k1 ϵ) that computes hopk has depth L = Ω(log k).

Proof. The proof is analogous to that of Corollary 3.5. Let C1 be a cycle on k vertices, and C2 be the union of two cycles each on k/2 vertices. So both C1 and C2 have k edges. We show that the existence of T Transformer N m,L,H with m H = O(k1 ϵ) such that T(X) = hopk(X) can be used to design an Θ(L)-round MPC protocol π to solve the task.

As a result of Theorem C.1, there exists an MPC protocol π that exactly computes T with R = Θ(L) rounds with local memory s = O(D1 ϵ/2) and q = O(N 2) machines. On input G = (V, E) {C1, C2}, we design a constant-round protocol that computes an sequence X ΣN such that hopk(X)N exactly determines the identity of G.

Since the k edges are passed to π in an unknown ordering with unknown labelings, we let V = [k] and denote the edges as e1 = {u1, v1} , . . . , ek = {uk, vk}. We define an operator next over the domain {(u, v), (v, u) : {u, v} E} as follows: for {u, v} E, let next(u, v) := (v , u) where v V is the unique vertex v = v such that {u, v } E. Notice that next is well-defined because all vertices in a cycle have degree 2. If G = C2, then nextk/2(ui, vi) = (ui, vi) for any i [k].

To set up our encoding of G as a sequence X, we first construct a gadget for each edge ei that will be used to compute a single next(ui, vi). Under the alphabet Σ = [k] { , , _}, we define the nine-token sequence

ei = ui vi ui vi _.

This gadget ensures that two hops will swap the values of ui and vi. That is

find2 ei ui(10) = find1 ei ui(6) = 4, Xfind2 ei ui(10) = vi,

find2 ei vi(10) = find1 ei vi(8) = 2, Xfind2 ei vi(10) = ui.

Likewise, concatenating sequences corresponding to overlapping edges facilitates multiple hops. For example, if e1 = (1, 2), e2 = (3, 4), e3 = (2, 3), then

find2 e1 e2 e3 2(28) = 22, Xfind2 e1 e2 e3 2(28) = 3,

find4 e1 e2 e3 2(28) = 13, Xfind4 e1 e2 e3 2(28) = 4,

find4 e1 e2 e3 3(28) = 2, Xfind4 e1 e2 e3 3(28) = 1.

Transformers, Parallel Computation, and Logarithmic Depth

Let E := (e1 e2 ek)k/2 1

be a length Nk := 9k k

2 + 1 sequence and let X = (_)N Nk E. We show that hopk(X)N = hopk(E)Nk = 1 if and only if G = C2.

Without loss of generality, let {j, j + 1} = eij E for all j [ k

2 1]. Let ei0 = {1, v }, where v = k

2 if G = C2 and v = k if G = C1. Assume without loss of generality that i1 > i0. We argue inductively that for any j [ k

1. Every two hops simulates a single step of next:

hop2j(E)Nk = nextj(1, v )1 =

( j if j + 1 < k

2 or G = C1, 1 if j = k

2. Every two hops never jumps by more than one repetition of all edges gadgets:

find2j E (Nk) find2j 2 E (Nk) 9(k 1);

3. The executed gadget corresponds to the correct edge and the gadget is executed correctly:

find2j E (Nk) {9kj + 9ij + ι : j N, ι {2, 4}} .

If all three conditions are met, then hopk(X)N = 1 if and only if G = C1 from condition 1.

We first show that the base case holds for j = 1. Since i1 > i0, the second-last time 1 appears in the E is in the final encoding ei1. By the two-case analysis of the ei1 gadget, we validate that hop2(E)Nk = 2 and conditions (1) and (3) hold. Since ei1 cannot be the first edge encoding appearing in e1 e2 ek, owing to it following ei0), condition (2) is satisfied.

Suppose that the inductive hypotheses holds up to j < k

2. Then, we argue that it holds for j + 1. Since hop2j(E)Nk = j + 1 (from condition (1)) and find2j E (Nk) resides at the left-most side of the gadget for eij (from condition (3)), the two subsequent find E iterations must occur in the gadget eij+1. Because find2j E (Nk) 9k(k j) (from condition (2)), all edges appear in the k gadgets to the left of find2j E (Nk), and all other edges (including eij+1) must occur before the next occurrence of eij. Thus, the two hops occur in the eij+1 gadget (within distance 9(k 1)) and results in a properly positioned find2j+2 E (Nk) with hop2j+2(E)Nk = nextj+1(1, v )1.

Since an MPC protocol can convert G to X using a constant number of layers, and because π outputs T(X)N = 1 if and only if G = C1, we can construct a protocol of π by simulating π . Because the protocol π uses s = O(k1 ϵ/2) local memory and q = O(k2/ξ) machines, Conjecture 2.4 implies that the existence of T requires L = Ω(log k).

E Proofs from Section 5

E.1 Multi-Player Pointer Chasing Communication Complexity

We introduce the multi-pass multi-player blackboard communication model studied by Guha & Mc Gregor (2009) and Assadi & N (2021) to prove lower bounds for multi-pass streaming algorithms. A protocol in this model specifies how k players, each possessing a portion of a shared input, can jointly compute a function on the input over the course of R rounds of communication. In each round, all players take turns to broadcast an s-bit message to all other players. We provide a formal definition of the model as described in Section 6 of Assadi & N (2021).

Definition E.1. A k-player R-round s-space sequential blackboard communication protocol includes k players P1, . . . , Pk. On input Z that can be partitioned into (Z1, . . . , Zk), each player Pj is provided with its respective Zj. In each round, players communicate via a shared blackboard. That is, in round r and in order Pk, . . . , P1, each player Pj writes a message Πr j {0, 1}s on the blackboard (which can be viewed by all players) as a potentially randomized function of input Zj and all information on the blackboard. After the conclusion of R rounds, the final message ΠR 1 is the output of the protocol.

Transformers, Parallel Computation, and Logarithmic Depth

Assadi & N (2021) proves a lower bound on the round complexity necessary to solve the well-studied multi-party pointer chasing problem of Nisan & Wigderson (1993). We present the problem as defined by Assadi & N (2021). Definition E.2. For q, k Z+, let an (q, k)-layered graph G = (V, E) have disjoint vertex layers V1, . . . , Vk+1 with V = V1 Vk+1 and each |Vj| = q and edge layers E1, . . . , Ek with E = E1 Ek and each Ej being a perfect matching between Vj and Vj+1. The pointer chasing task is provides a (q, k)-layered graph G, an arbitrary v V1, and an arbitrary equipartition V 1 k+1 and V 2 k+1 of Vk+1 as input and asks whether v is connected to a vertex in V 1 k+1 or V 2 k+1.

Assadi & N (2021) give the following lower bound. Proposition E.3 (Proposition 4.12 of Assadi & N, 2021). Consider a k-player R-round s-space sequential blackboard protocol that solves the (q, k)-pointer chasing task where each player Pj is provided with the matching Ej and v and V 1 k+1, V 2 k+1 are globally known. Then, the protocol succeeds with probability at least 2

3 only if R k or s = Ω( q

All of the lower bounds in Section 5 are most naturally proved by reducing from hopk, rather than pointer chasing. So we first prove a lower bound for hopk using the lower bound for pointer chasing from Proposition E.3. Proposition E.4. Consider a k-player R-round s-space sequential blackboard protocol that computes hopk(X)N on any X ΣN for Σ = [2q + 2] with q = N

2k where each player Pj is provided with Xj := (X2(k j)q+1, . . . , X2(k j+1)q), except for P1, who is given X1 := (X2(k 1)q+1, . . . , XN). Then, the protocol succeeds with probability at least 2

3 only if R k or s = Ω( N

Proof. Assuming the existence of a k-player R-round s-space sequential blackboard protocol for hopk(X)N as described above, we design a protocol for solving (q, k)-pointer chasing with R rounds and s-size messages. The claimed lower bound will then follow by Proposition E.3.

Consider any pointer chasing input with universally known V1, . . . , Vk+1, v V1, and V 1 k+1 and V 2 k+1, and each player Pj knowing matching Ej. We recursively define v1, . . . , vk+1 such that v1 = v and (vj, vj+1) Ej, noting that the output hinges on whether vk+1 V 1 k+1.

Without loss of generality, let v = 1 and

( {1, . . . , q} if j is odd, {q + 1, . . . , 2q} if j is even.

Each player independently determines their substring Xj of a input X to hopk before running the aforementioned protocol:

Player P1 encodes X1 by letting XN = s = 1 and for any i 1, . . . , 2q, letting

2 V1 if i is odd, i V2 if i is even, ( i

This ensures that that every integer in {1, . . . , 2q} appears exactly once in X1 1, . . . , X1 2q, which in turn guarantees that find1 X(N) = (k 1 + 1)q + 2 and that Xfind1 X(N) = v2 where (1, i ) E1.

For any j {2, . . . , k 1}, player Pj encodes Ej as Xj as follows. If j is odd, then for every i {1, . . . , 2q},

2 Vj if i is odd, i Vj+1 if i is even, ( i

Alternatively, if j is even,

2 Vj if i is odd, i Vj+1 if i is even, (q + i

Since every odd token corresponds to a vertex in Vj and each subsequent token corresponds to the vertex it s connected to by Ej, we can ensure that for every i [2q]:

(X2(k j+1)+i, Xfind1 X(2(k j+1)+i)) Ej.

Hence, it follows inductively that Xfindj X(N) = vj+1.

Transformers, Parallel Computation, and Logarithmic Depth

Player Pk encodes Xk if k is odd by letting

Xk i = Xi =

2 Vk if i is odd, 2q + 1 if i is even, ( i

2, v) Ek, and v V 1 k+1, 2q + 2 if i is even, ( i

2, v) Ek, and v V 2 k+1.

Likewise, if k is even,

Xk i = Xi =

2 Vk if i is odd, 2q + 1 if i is even, ( i

2, v) Ek, and v V 1 k+1, 2q + 2 if i is even, ( i

2, v) Ek, and v V 2 k+1.

These jointly ensure that

hopk(X)N = Xfindk X(N) =

( 2q + 1 if vk+1 V 1 k+1, 2q + 2 if vk+1 V 2 k+1.

Therefore, by formatting E1, . . . , Ek appropriately as X, running the protocol for hopk(X)N, and observing that the final output of player P 1 is 2q + 1 if and only if vk+1 V 1 k+1, there exists a k-player R-round s-space protocol for pointer chasing. Hence, by Proposition E.3, the protocol for hopk(X)N must use R k rounds or s = Ω( N

k6 ) space.

E.2 Proofs of Section 5.2

Corollary 5.2. A multi-layer RNN of depth L and width m as above with YN = hopk(X)N satisfies either L k or m = Ω( N

Proof. Suppose there exists a multi-layer RNN computing output Y with YN,1 = hopk(X)N from input X with intermediate states Z1, . . . , ZL 1 and hidden states H1, . . . , HL. For any ℓ [L] and i i , note that Zℓ i , . . . , Zℓ i can be determined exactly from Hℓ i 1 and Zℓ 1 i , . . . , Zℓ 1 i . Given this RNN, we provide a multi-player blackboard communication protocol for solving hopk(X)N under the input model of Proposition E.4.

In round r, we assume inductively that each player Pj knows Zℓ 1,j = (Zℓ 1 2(k j)q+1, . . . , Zℓ 1 2(k j+1)q), except for P1, who

knows Zℓ 1,1 = (Zℓ 1 2(k 1)q+1, . . . , Zℓ 1 N ). In descending order, each player Pj computes Zℓ,j and Hℓ 2(k j+1)q writing the latter on the blackboard from Zℓ 1,j and Hℓ 2(k j)q,which was written on the blackboard by the previous player. Thus, player P 1 after round L knows and outputs ZL N,1 = YN,1 = hopk(X)N, which provides an L-round protocol m-space protocol.

So the claimed lower bounds on width and depth follow from Proposition E.4.

E.3 Proofs of Section 5.3

Corollary 5.3. Any T Kernel Former N m,m ,L,H with T(X)N = hopk(X)N satisfies either L k or mm Hp = Ω( N

Proof. Under the distribution of input X = (X1, . . . , Xk) to players P1, . . . , Pk stipulated in the statement of Proposition E.4, we explain how the players can all compute the outcome of a single layer of H-headed kernelized attention in a single round of a blackboard protocol. It is immediate that a depth L network can be simulated in L rounds.

On input X, consider H kernelized self-attention units with embeddings (Q 1, K 1, V1), . . . , (Q H, K H, VH) and output MLP ψ. Each player Pj immediately computes its embeddings (Q h(Xj), K h(Xj), Vh(Xj))h [H], followed by (K h(Xj)TVh(Xj)) Rm m for each h [H]. Because the object is to compute for each h

ψ(Q h(X)K h(X)TVh(X)) = ψ(Q h(X)

j=1 K h(Xj)TVh(Xj)),

Transformers, Parallel Computation, and Logarithmic Depth

each player writes their (K h(Xj)TVh(Xj))h [H] using message size s = Θ(mm Hp). Each can then construct K h(X)TVh(X)) by reading the board, and use it to compute its respective outputs without requiring supplemental communication.

Hence, T (and thus hopk(X)N) can be simulated using an L-round blackboard protocol with message size s = Θ(mm Hp), and the corollary follows from Proposition E.4.

Corollary 5.4. Any T Λw,g-Attn N m,L,H with T(X)N = hopk(X)N satisfies either L k or (w + N

gk)m Hp = Ω( N

Proof. As in the proof of Corollary 5.3, we explain how each player can compute their respective outputs of a single unit of self-attention masked by Λw,g.

To compute the output corresponding to Xi, note that it is necessary to only know the embeddings corresponding to Xi w, Xi w+1, . . . , Xi+w and Xg, X2g, . . . , X N/g g. Thus, player Xj can compute the outputs of all of their inputs Xj = (X2(k j)q+1, . . . , X2(k j+1)q) given access to

X2(k j)q+1 w, . . . , X2(k j)q, X2(k j+1)q+1, . . . , X2(k j+1)q+w,

as well as Xg, X2g, . . . , X N/g g.

Therefore, the protocol can be simulated if each player Xj writes inputs

X2(k j)q+1, . . . , X2(k j)q+w, X2(k j+1)q w+1, . . . , X2(k j+1)q Rm,

in addition to all Xi Xj such that i 0 (mod g). This can be accomplished by a protocol where each player writes s = O((w + N

gk)mp) bits of information on the blackboard.

By repeating this protocol in parallel for every head and sequentially for every layer, T and hopk(X)N can be simulated, and hence the claim follows from Proposition E.4.

E.4 Proofs of Section 5.4

Corollary 5.6. Any T Mask Transformer N+NCo T m,1,H that computes hopk(X)N with NCo T tokens of chain-of-thought requires either NCo T k or m Hp = Ω( N

Proof. We reduce to Proposition E.4. Consider some input X RN partitioned into X1, . . . , Xj as specified by the proof of Proposition E.4 with chain-of-thought XCo T and hopk(X)N determined by some masked transformer T.7 Suppose T has embeddings (Qh, Kh, Vh)h [H] and output MLP ψ. We provide an (NCo T + 1)-round blackboard protocol to compute hopk(X)N from X.

Suppose in the rth round of the protocol, all players know XCo T,1, . . . , XCo T,r 1 and aim to compute

T(X XCo T)N+r 1 =

( XCo T,r if r NCo T hopk(X)N if r = NCo T + 1

PN+r 1 i=1 exp(Qh N+r 1(XN+r 1)TKh i (Xi)T)V h i (Xi) PN+r 1 i=1 exp(Qh N+r 1(XN+r 1)TKh i (Xi))

7We abuse notation to index XN+i = XCo T,i and let Xi Xj be true if i {2(k j)q + 1, . . . , w(k j + 1)q}.

Transformers, Parallel Computation, and Logarithmic Depth

Xi Xj exp(Qh N+r 1(XN+r 1)TKh i (Xi)T)V h i (Xi) Rm,

Sr,h,Co T =

i=N+1 exp(Qh N+r 1(XN+r 1)TKh i (Xi)T)V h i (Xi) Rm,

Xi Xj exp(Qh N+r 1(XN+r 1)TKh i (Xi)T) R,

Zr,h,Co T =

i=N+1 exp(Qh N+r 1(XN+r 1)TKh i (Xi)T) R,

then we observe that

T(X XCo T)N+r 1 = ψN+r 1

Pk j=1 Sr,h,j + Sr,h,Co T Pk j=1 Zr,h,j + Zr,h,Co T

Each player Pk computes (Sr,h,j, Zr,h,j)h [H] and writes them on the blackboard with O(m Hp)-bit messages. Since Sr,h,Co T and Zr,h,Co T are known by all players, every player can individually T(X XCo T)N+r 1.

By induction, all players know hopk(X)N after NCo T + 1 rounds. The claim now follows from Proposition E.4.

F Proofs of low-level attention constructions

F.1 Hardmax simulation proof of Appendix A.1

Lemma A.2. Let f Attn N m be a self-attention unit with precision p = Θ(log N) and embedding functions Q, K, V such that for some fixed 1 ξ = N O(1) and every X RN m and i [N]:

A(X)i,i max i A(X)i,i ξ, i Imax(A(X)i),

where A(X) = Q(X)K(X)T. Then there exists a self-attention unit f Attn N m with a valid p -bit implementation with p = O(p) satisfying f (X) = hardmax(A(X))V (X).

Proof. For some p = Θ(p + log 1

ξ) and c = Θ( p +ζ

ξ log N) where ζ is as in Appendix A.1), let f have query embedding Q (X) = c Q(X) and identical key K and value V embeddings as f. Therefore, by construction, these embeddings can be written with precision p = O(ln(c) + p) = O(log 1

ξ + log log N + p) = O(p).

Let ˆf be a valid p -bit implementation of f , meaning that the two ˆf f = O(1/2p+1) (thus ˆf rounds f to p bits of precision), and fix some X. We first show that the softmax matrix is sufficiently close to that of the hardmax and is also a valid p -bit implementation of the hardmax. Without loss of generality, let 1 Imax(A(X)i). First, note that

i Imax(A(X)i) exp(c A(X)i,i ) N exp(cξ) exp(c A(X)i,1) = 1 N O(p +ζ) exp(c A(X)i,1).

|softmax(c A(X))i,1 hardmax(A(X))i,1| = 1 |Imax(A(X)i)| exp(c A(X)i,1) PN i =1 exp(c A(X)i,i )

i Imax(A(X)i) exp(c A(X)i,i )

|Imax(A(X)i)| exp(c A(X)i,1) = 1 N Ω(p +ζ) .

Transformers, Parallel Computation, and Logarithmic Depth

Likewise, for any i Imax(A(X)i):

|softmax(c A(X))i,i hardmax(A(X))i,i | exp(c A(X)i,i ) PN i =1 exp(c A(X)i,i ) = 1 N Ω(p +ζ) .

softmax(c A(X))i hardmax(c A(X))i 2

N max i |softmax(c A(X))i,i hardmax(c A(X))i,i | = 1 N Ω(p +ζ) .

We conclude that the approximation is sufficiently close, meaning it is O(1/2p ), whereby it is exact after rounding: ˆf (X) hardmax(Q(X)K(X)T)V (X)

f (X) hardmax(Q(X)K(X)T)V (X) + ˆf (X) f (X)

max i,j softmax(c A(X))T i V (X) ,j hardmax(A(X))T i V (X) ,j + O 1

max i,j softmax(c A(X))T i hardmax(A(X))T i 2 V (X) ,j 2 + O 1

1 N Ω(p +ζ)

N N ζ + O 1

Therefore, ˆf is a valid p -bit implementation of hardmax(Q(X)K(X)T)V (X).

F.2 Constructions for Appendix B.1

Proposition B.1. For any b N and d, there exists a self-attention unit sparse Propagate Q,d Attn N m,p for m = d + O(Q log N) and p = O(log N), which, given any input X with Xi = (zi, Si, 0) Rd [N] Q {0}m Q d such that bi = |{Sj i : j [N]}| Q for all i, has output sparse Propagate Q,d(X) satisfying

sparse Propagate Q,d(X)i = 1

Proof. Following the proof of Theorem 2 of Sanford et al. (2023), there exist p-bit precision vectors u1, . . . , u N { 1/ m}m and w S with w S 2 Q for all S N Q such that

u T i w S = 1, for all i S

u T i w S 1

2, for all i S.

We then design the embeddings of sparse Propagate Q,d with

Q(X)i = (ui, 1),

( (w Si, 0) if i > 0, ( 0, 3

4) if i = 0,

( zi if i > 0, 0 if i = 0.

As a result,

Q(X)T i K(X)i = 1 if i Si , i > 0,

Q(X)T i K(X)i 1

2 if i Si , i > 0,

Q(X)T i K(X)0 = 3

Transformers, Parallel Computation, and Logarithmic Depth

Hence, the largest inner products for query i correspond to i for all Si i if any exist, and 0 otherwise. There exists a margin of at least 1

4 between the largest inner product in each row and all others. By applying Lemma A.2, we conclude that there exists a self attention unit f with embedding dimension p = Θ(log N) that computes

f (X) = hardmax(Q(X)K(X)T)V (X) = sparse Propagate(X).

F.3 Constructions for Appendix B.2

Lemma B.4. For any MPC protocol π with local memory s and q machines with nin-word inputs, there exists a transformer init Transformernin,max(nin,q) s,1,1,din,dout with din = 1 and dout = s, which, given Input Zn 2p, has output satisfying init(Input) = Machine In(1).

Proof. Let M = max(nin, q) and Q, K, V : ZM 2p RM s be the query, key, and value embeddings of the attention unit f in init, and let ψ : RM s Zs 2p [N] be its output MLP. Let qin = nin

s denote the number of machines used to store the inputs.

Let Desti = l i

s m [qin] denote the machine that stores the input token index i [nin] in the MPC protocol, and let

Rcvdi = {(s 1)i + 1, . . . , min(si, nin)}

denote the set of all input tokens indices belonging to Machine In(1) i for machine i [qin].

For each machine i [qin], we define the query embedding as

Q(Input)i = cos 2πi

, . . . , cos 2πi

Likewise, for each token index i [nin], the key and value vectors are

K(Input)i ,(2ι 1,2ι) =

( cos 2π Desti

M , sin 2π Desti

M if i nin, i ι (mod s),

(0, 0) otherwise,

V (Input)i ,(2ι 1,2ι) =

( (Inputi , i ) if i nin, i ι (mod s), (0, i ) otherwise.

These definitions guarantee that large inner products only occur between machine queries Q(Input)i and tokens keys K(Input)i when Inputi is allocated to Machine In(1) i . That is,

Q(Input)T i K(Input)i = 1, if i Rcvdi

Q(Input)T i K(Input)i 1 Ω 1

, otherwise.

By applying Lemma A.2 with ξ = Ω( 1

N 2 ), there exists some self-attention unit f such that

f (Input)i = hardmax(Q(Input)K(Input)T) = (Inputi , i )i Rcvdi

A proper choice of ψ and an invocation of the definition of Machine In(1) ensures that init(Input)i = ψ(f(Input))i = Machine In(1) i .

Lemma B.6. For any R-round MPC protocol π with local memory s and q machines with nout-word output, there exists a transformer final Transformerq,max(nout,q) s,1,1,din,dout for din = s and dout = 1, which, given input X = Machine In(R), has output final(X) with final(X)i,1 = Outputi Z2p.

Transformers, Parallel Computation, and Logarithmic Depth

Proof. This argument inverts that of Lemma B.4, after applying the Local R to transform Machine In(R) to Machine Out(R). Let Q, K, V : ZM 2p RM s be the query, key, and value embeddings of the only attention unit f in final, and let ψ : RM s Zs 2p [N] be its output MLP. Let qout = nout

s denote the number of machines storing relevant information for the output of the MPC protocol.

For each machine i [qout], let

Senti = {(s 1)i + 1, . . . , min(si , nout)}

denote the set of all token indices receiving its output. Likewise, for each token index i [nout], let Srci = i/s be the machine containing its relevant token. We define Q = Q Local R, K = K Local R, V = V Local R as follows.

Q (Machine Out(R))i,(2ι 1,2ι) =

( cos 2π Srci

M , sin 2π Srci

M if i nout, i ι (mod s)

(0, 0) otherwise.

K (Machine Out(R))i = cos 2πi

, . . . , cos 2πi

V (Machine Out(R))i = Msg Out(R) i .

Applying Lemma A.2 as before yields

f(Machine In(R))i =

( Machine Out(R) i if i Senti , 0 otherwise.

A properly chosen ψ ensures that final(Machine In(R))i = ψ(f(Machine In(R)))i = Outputi.

F.4 Constructions for Appendix D.1

Lemma D.1. For some m d + 2, τ : [N] Rm [N], and ρ : Rm Rd, there exists an attention head look Upτ,ρ Mask Attn N m with precision p = O(log N) and m d + 2 satisfying look Upτ,ρ(X)i,:d = ρ(Xτ(i,Xi)).

Proof. We let V (Xi) = (ρ(Xi), 0) and define sinusoidal embeddings Q and K with

Q(X)i = cos 2πτ(i, Xi)

, sin 2πτ(i, Xi)

K(X)i = cos 2πi

Q(X)T i K(X)i = 1, if τ(i, Xi) = i ,

Q(X)T i K(X)i cos 2π

, otherwise.

By applying Lemma A.2 with ξ = Ω( 1

N 2 ), we conclude that a satisfactory self-attention unit exists.

Lemma D.2. For any finite alphabet Σ, m d + 2, µ1, µ2 : Rm Σ, and ρ : Rm Rd, there exists an attention head last Occurrenceµ,ρ Mask Attn N m with precision p = O(log(N |Σ|)) such that,

last Occurrence(X)i,:d =

( ρ( 0) if i < i : µ1(Xi ) = µ2(Xi), ρ(Xi ) if i = max {i < i : µ1(Xi ) = µ2(Xi)} .

Transformers, Parallel Computation, and Logarithmic Depth

Proof. Let N = N|Σ|. We define token embeddings as follows, including start token dummy embeddings as discussed in Appendix A.1.

Q(X)i = cos 2π(Nµ2(Xi) + i)

, sin 2π(Nµ2(Xi) + i)

K(X)i = cos 2π(Nµ1(Xi) + i)

, sin 2π(Nµ1(Xi) + i)

K(X)0 = 0, 0, cos 2π(N 1

V (X)i = (ρ(Xi), 0),

V (X)0 = 0.

Taken together, these embeddings provide the following characterization of the inner products (with causal masking matrix Γ):

Q(X)T 0 K(X)i + Γi,i = cos 2π(i i )

if i i > 0, µ1(Xi ) = µ2(Xi),

Q(X)T i K(X)i + Γi,i cos 2π

if i i > 0, µ1(Xi ) = µ2(Xi),

Q(X)T i K(X)i + Γi,i = if i < i ,

Q(X)T i K(X)i + Γi,0 = cos 2π(N 1

As a result, the largest inner product Q(X)T i K(X)i for some i is the largest i with µ1(Xi ) = µ2(Xi) if one exists and i = 0 otherwise. Furthermore, there exists a margin of Ω( 1 N2|Σ|2 ) between this inner product and all others. We conclude by applying Lemma A.2.

G Further Empirical Analysis of k-Hop Induction Heads

This appendix presents in-depth explanations of the empirical results of Section 4.2, along with further experiments. Taken together, these results suggest that the relationship between the number of hops k and the depth L of transformers trained on the task is well-characterized by the representational thresholds of Theorem 4.2 and Corollary 4.3; that the construction described in the proof of Theorem 4.2 is attainable by trained models; and deep models likely exhibit an inductive bias that favors compositional learning rules in the finite sample regime.

We define our experimental methodology precisely in Appendix G.1 and provide supporting evidence for our claims in the subsequent sections.

Exponential Powers of Depth. Our principal empirical claim is that incrementing the depth L of a transformer exponentially increases the model s capabilities to learn k-hop induction heads tasks. We explore this claim primarily in Appendix G.2, where we compare this empirical claim with the relevant theoretical results (Theorem 4.2 and Corollary 4.3), which suggest a similar dependence. We further study the impacts of increasing the embedding dimension m of the transformer in Appendix G.3 and find that doubling the width is roughly equivalent in performance to incrementing the depth by one.

Empirical Claim G.1. A transformer T Mask Transformer N m,L,H trained with Adam to solve hopk has small token-wise classification error if L log(m) = Ω(log k) and large error if L log m = O(log k).

Mechanistic Alignment with Theoretical Construction. We further demonstrate the empirical salience of our theoretical construction by conducting a study of the interpretability of learned transformers in Appendix G.4. This investigation reveals that the attention matrices of sufficiently deep transformers exhibit an implementation of a circuit that relies on the same doubling principle of the construction in the proof of Theorem 4.2. The resulting circuit is comprised of the same intermediate products that are used in that hopk construction.

Transformers, Parallel Computation, and Logarithmic Depth

Empirical Claim G.2. The outputs of individual attention matrices of a transformer T Mask Transformer N m,L,H trained with Adam to solve hopk with L = Ω(log k) and evaluated on input X ΣN (i) correspond to the findj X intermediate products of the Theorem 4.2 construction and (ii) demonstrate a doubling phenomenon where the each head layer ℓ corresponds to findj X for some j = O(2ℓ).

Beneficial Inductive Biases of Depth. While most of our experiments belong to the infinite-sample regime where new samples are randomly generated on each training step, we also evaluate our models in two finite-sample regimes in Appendix G.5. We find that a small number of samples is sufficient to approach the performance of the infinite-sample regime. When the amount of training data is small, we find that deeper models perform better than shallower models, possibly due to an inductive bias that favors compositional hypotheses. Empirical Claim G.3. hopk can be learned in a sample-efficient manner by transformers T Mask Transformer N m,L,H trained with Adam with L = Ω(log k). If T overfits to hopk tasks for some k, then increasing the depth L while holding k fixed leads superior performance.

The experiments detailed here were conducted under limited computational resources. The authors are interested in future work that would evaluate whether these scaling rules persist on larger architectures and more complex tasks.

G.1 Experimental Details

Task Details. We study a multi-task variant of k-hop induction heads that predicts hopk(X) = (0, hopk(X )) from input X = (k, X ) for k {0, 1, . . . , kmax}8 and X ΣN 1. We refer to this task as multi-hop and provide the task hyper-parameters in Table 1.

Hyperparameter Value

Context length N 100 Alphabet size |Σ| 4 Max hops kmax 16

Table 1. Multi-hop task hyper-parameters

We define the distribution Dmulti hop over labeled samples for the multi-hop task and DX over input sequences X ΣN 1. We draw a labeled sample (X, hopk(X)) Dmulti hop by independently sampling k Unif({0, 1, . . . , kmax}) and X DX . Input sequences X DX are drawn uniformly from inputs with no repeating elements. That is, we sample X 1 Unif(Σ) and each X j+1 Unif(Σ \ X j ). For each k [kmax], let Dhopk denote the conditional distribution ((k , X ), (0, hopk (X ))) Dmulti hop | (k = k ). Also, let dom(hopk) = {(k, X ) : Pr [X DX ] > 0}.

For Σ := Σ [kmax], we define the n-sample empirical token-wise classification error of a transformer T : Σ N Σ N on a task hopk as

errn k(T) = 1

1 | {i : hopk(Xι)i = } |

i=1 1 {T(Xι)i = hopk(Xι)i = } ,

for iid samples (X1, hopk(X1)), . . . , (Xn, hopk(Xn)) Dhopk. We ignore null outputs of hopk when no k-hop induction head exists in order to avoid inadvertently over-estimating the performance of transformers on large k tasks, which have a large fraction of null outputs.

Training Details. We trained a variety of causally-masked GPT-2 transformers (Radford et al., 2019) from Hugging Face to solve the multi-hop task. The model has an absolute positional encoding.

The transformers are trained with Adam (Kingma & Ba, 2014) on the cross-entropy loss. In the infinite-sample regime, we draw 32 new iid samples from Dmulti hop on each training step. Otherwise, ntrain samples are drawn before training commences and all samples are rotated through batches, before repeating. We use the hyper-parameters in Table 2 to train all of the models identified in Table 3.

Computational Resources. All experiments were run on a 2021 Macbook Pro with an M1 chip.

8The task hop0 is simply the identity mapping: hop0(X ) = X .

Transformers, Parallel Computation, and Logarithmic Depth

Hyperparameter Value

Embedding dimension m {128, 256} Depth L {2, 3, 4, 5, 6} Number of heads H {4, 8} Vocabulary size 30 Activation function Ge LU Layer norm ϵ 10 5

Training samples ntrain 103, 3 103,

Learning rate 10 4

Training steps 105

Batch size 32

Table 2. Model and training hyper-parameters

Identifier Heads H Embedding dimension m Depth L Training samples ntrain Total parameters

T 4,2 4 128 2 413,440 T 4,3 4 128 3 611,712 T 4,4 4 128 4 809,984 T 4,5 4 128 5 1,008,256 T 4,6 4 128 6 1,206,528 T 8,2 8 256 2 1,613,312 T 8,3 8 256 3 2,403,072 T 8,4 8 256 4 3,192,832 T 8,5 8 256 5 3,982,592 T 8,6 8 256 6 4,772,352 T 3000 4,2 4 128 2 3000 413,440 T 3000 4,3 4 128 3 3000 611,712 T 3000 4,4 4 128 4 3000 809,984 T 3000 4,5 4 128 5 3000 1,008,256 T 3000 4,6 4 128 6 3000 1,206,528 T 1000 4,2 4 128 2 1000 413,440 T 1000 4,3 4 128 3 1000 611,712 T 1000 4,4 4 128 4 1000 809,984 T 1000 4,5 4 128 5 1000 1,008,256 T 1000 4,6 4 128 6 1000 1,206,528

Table 3. Hyper-parameters of all Mask Transformer N m,L,H trained for the empirical analysis.

G.2 Exponential Increases in k-Hop Capacity with Depth (Empirical Claim G.1; Figures 6 to 8)

We visualize the relationship between the depth L of a transformer and the largest k such that errn k(T) is small in Figure 6, Figure 7, and Figure 8. We exhibit the relationship in its simplest form by considering transformers with heads H = 4, embedding dimension m = 128, and new training samples on every epoch. The figures provide alternate views of errn k(T 4,L) for each L {2, 3, 4, 5, 6} with n = 100 samples for each k [kmax].

Together, these plots illustrate a sharp phase transition when D = log2 k + 2, which identically matches the depth scaling in Theorem 4.2. Increasing the depth of a transformer by one approximately doubles the number of values k [kmax] with bounded error. For instance, following the theoretical and empirical intuition of (Bietti et al., 2023), the depth L = 2 transformer T 4,2 succeeds in solving the standard induction heads task, but attains at least 10% error on all other tasks. Likewise, a depth L = 3 model has error bounded by 1% for k {1, 2}, which increases rapidly for larger values of k.

This doubling phenomenon suggests that simple compositional tasks with a larger number of compositions than the depth of the model are easily learnable if the model can employ a doubling trick, similar to the one used in the proof of Theorem 4.2.

Transformers, Parallel Computation, and Logarithmic Depth

1 2 4 8 16 k

Error: errk

Evaluation of L-depth, 4-headed, infinite-sample intransformers on hopk

L = 2 L = 3 L = 4 L = 5 L = 6

Figure 6. Zoomed in version of Figure 2. Evaluation of transformers errn k(T 4,L) with depths L {2, 3, 4, 5, 6}, heads H = 4, and embedding dimension m = 128 trained on the multi-hop task. This figure plots errn k(T 4,L) on n = 100 samples as a function of k for each choice of L.

This relationship between compositionality and depth reflects the results of Zhang et al. (2023), where the learnable task complexity also scales super-linearly in depth.

Given the lower bounds of Corollary 4.3, one may ask why models with depth L < log2 k achieve non-trivial success on hopk tasks that cannot be represented in a compositional manner. There are several relevant explanations:

1. In these experiments, the embedding dimension m = 128 is actually larger than the context N = 100, which may enable the model to memorize more of its preceding samples and offload logical work to the MLP, rather than executing a pointer-doubling strategy. While practical models regularly have the opposite (and our theoretical results are oriented around that parametric scaling), we used a larger m than is necessary for representational purpose to improve the optimization landscape and speed convergence.

2. This is made further plausible by the small alphabet size |Σ| and randomly drawn sequences X , which place effective bounds on how much look-back from each token i is necessary to compute hopk(X)i.

Nonetheless, these results provide strong support that models are substantially easier to train to low classification error in the regime where the depth is sufficient to implement a pointer-doubling construction. In the following subsection, we further investigate this phenomenon by examining the intermediate attention matrices produced by trained models.

Transformers, Parallel Computation, and Logarithmic Depth

2 3 4 5 6 Depth L

Error: errk

Evaluation of L-Depth, 4-headed, infinite-sample transformers on hopk

k = 1 k = 2 k = 3 k = 4 k = 5 k = 6 k = 7 k = 8 k = 9 k = 10 k = 11 k = 12 k = 13 k = 14 k = 15 k = 16

Figure 7. Alternate view of Figure 6 including errn k(T 4,L) plotted as a function of L for each k.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 k

2 3 4 5 6 Depth L

0.01 0.14 0.35 0.42 0.54 0.59 0.62 0.66 0.69 0.69 0.71 0.72 0.73 0.73 0.74 0.74

0.003 0.008 0.048 0.15 0.28 0.36 0.42 0.46 0.52 0.57 0.6 0.64 0.67 0.67 0.69 0.71

0.001 0.003 0.005 0.015 0.034 0.062 0.11 0.22 0.28 0.36 0.4 0.47 0.53 0.53 0.58 0.61

0.001 0.001 0.001 0.003 0.012 0.015 0.017 0.037 0.058 0.1 0.15 0.22 0.28 0.34 0.38 0.41

0 0 0.001 0.002 0.003 0.003 0.007 0.006 0.007 0.01 0.011 0.012 0.021 0.027 0.035 0.062

Figure 8. Alternate views of Figure 6 including errn k(T 4,L) as a table with one cell for each (L, k) pair.

Transformers, Parallel Computation, and Logarithmic Depth

G.3 Width Variation (Empirical Claim G.1; Figure 9)

While the primary focus of these empirical results and the paper as a whole is on the role of depth in the ability of transformer to learn parallelizable and compositional tasks, we also aim to understand the interplay of depth and width in learning the multi-hop task. Here, we contrast the previous transformers T 4,L with models T 8,L that have more heads (H = 8) and larger embedding dimensions (m = 256). We plot the classification errors of all 10 architectures over 16 hopk sub-tasks in Figure 9.

Here, we observe a rough correspondence in performance between the transformers T H,L and T2H,L 1 and the same doubling phenomenon as is evident models with H = 4 heads. That is, while increasing the width improves the classification error of learned models, it does so in a far less parameter-efficient manner than incrementing the depth. As mentioned before, the relative success of wide and shallow transformers is likely contingent on the relatively short context length N and alphabet size |Σ|. However, these results still suggest an important role for wider models to play beyond representational capabilities of transformers.

1 2 4 8 16 k

Error: errk

Evaluation of L-depth, H-headed, infinite-sample intransformers on hopk

L = 2, H = 4 L = 2, H = 8 L = 3, H = 4

L = 3, H = 8 L = 4, H = 4

L = 4, H = 8 L = 5, H = 4

L = 5, H = 8 L = 6, H = 4 L = 6, H = 8

Figure 9. Comparison between the errors errn k(T H,L) of transformers with embedding dimension and heads (m, H) = (4, 128) (dashed line, same plots as Figure 6) and (m, H) = (8, 256) (solid line) trained on the multi-hop task, evaluated on n = 100 samples per hopk task.

Transformers, Parallel Computation, and Logarithmic Depth

G.4 Mechanistic Alignment with Theoretical Construction (Empirical Claim G.2, Figures 10 to 15)

We use standard attention-based interpretability techniques to better understand what particular logical circuits are implemented by transformers trained to solve the multi-hop task. By qualitatively inspecting the attention matrices produced by trained models and by measuring the alignment between those inner products and partial solutions findj of hopk, we uncover a striking correspondence between the behaviors of the trained models and the transformer construction designed in the proof of Theorem 4.2. We further observe that trained transformers with high accuracy have decisive self-attention units with particularly strong correlations to some findj intermediate, while poorly performing models have less predictable attention activations.

For a fixed trained model T Transformer N m,L,H, we let Aℓ,h[T](X) represent the output of the hth self-self attention matrix in the ℓth layer for h [H] and ℓ [L], evaluated at some input X dom(hopk). That is, we let

Aℓ,h[T](X) = softmax Qℓ,h(Xℓ 1)Kℓ,h(Xℓ 1)T + Γ RN N,

where Xℓ 1 is the intermediate state representing the output of layer ℓ 1 of T on input X and Γ is the causal masking matrix. Each row i in the matrix represents the coefficients of the convex combination of value vectors affiliated with each query, which can be used as a signifier of which embeddings i receives information from.

Visualization of findj Alignment for hop16 and Depth L = 6 (Figure 10). The outputs of self-attention matrices are often highly structured matrices that reveal which relationships between tokens are encoded and how information is shared within the model (Li & Mc Clelland, 2022; Clark et al., 2019; Rogers et al., 2021). We plot several self-attention matrices associated with a depth L = 6, heads H = 4 transformer trained in the infinite-sample regime and evaluated on a single sample X dom(hop16) in Figure 10.

By looking at the six self-attention matrices, one can infer that all heads are decisive and obtain nearly all of their relevant information from a single value embedding, rather than averages of a large number of embeddings. The top-left self-attention matrix, which belongs to the first self-attention head, clearly associates elements with their predecessors, which is identical the to the function of our look Up attention head in the first layer of the hopk construction of Theorem 4.2.

While the roles of the other heads are not immediately obvious, they can be understood by overlaying colored matrices with non-zero cells at (i, findj X(i)) for some j k. For instance, the top-right attention matrix in layer ℓ= 2 corresponds almost exactly with find1 X (as suggested by the second-layer of our construction), and the others are closely associated with find1 X, find2 X, find3 X, and find8 X for layers ℓ= 3, 4, 5, 6 respectively. This is a remarkably close correspondence to our construction, which includes a self-attention matrix in the ℓth layer whose activations correspond to find2ℓ 2 X .

While not conclusive, this experiment suggests a strong alignment between the behaviors of this particular transformer and our theoretical construction. This suggests a high likelihood that the transformer successfully learns to solve hop16 by employing a pointer-doubling primitive. However, these results apply to only a single model, a single task, and a single input; in the subsequent section, we generalize this interpretability analysis.

Alignment between Attention Jeads and findj for a Single hopk Sub-Task (Figures 11 to 13). To broaden and quantify the analysis of the previous section, we measure the extent to which each self-attention head mimics the functionality of findj, which are partial computations of hopk that are employed in the proof of Theorem 4.2. We use cell-wise matrix inner products to quantify the strength of correlation between a self-attention matrix and a fixed function potentially relevant to interpretability.

For two matrices A, B RN N, let

A, B = A B 2 F A F B F

be their normalized element-wise inner-product, where F is the Frobenius norm and denotes element-wise multiplication. For some function g : [N] {0} [N], we let g, B := Ag, B , where

( 1 if g(j) = i, 0 otherwise.

Transformers, Parallel Computation, and Logarithmic Depth

= 1, h = 1, find0

X highlighted

= 2, h = 4, find1

X highlighted

= 3, h = 2, find1

X highlighted

= 4, h = 2, find2

X highlighted

= 5, h = 3, find3

X highlighted

= 6, h = 1, find8

X highlighted

Self-attention matrix A , h[T4, 6](X), X dom(hop16)

Figure 10. The outputs of several internal self-attention matrices Aℓ,h[T 4,6](X) R100 100 of a trained multi-task transformer of depth D = 6 evaluated on a single sample X Dhop16 are plotted in grayscale. In each cell, the matrix with non-zero entries (findj X(i), i)i [N] for some j is included in transparent color to visualize the function of each self-attention unit.

Transformers, Parallel Computation, and Logarithmic Depth

We use this notation to analyze experimentally how closely the self-attention matrices Aℓ,h encode the intermediate products of the proof of Theorem 4.2, findj X. For n iid samples X1, . . . , Xn Dhopk, let

Aℓ,h, findj

D findj Xι, Aℓ,h(Xι) E .

Due to the non-negativity of Aℓ,h and findj, Aℓ,h, findj

n,k [0, 1], and Aℓ,h, findj

n,k = 1 only if ι [n]:

Aℓ,h(Xι)i,i = 1 findj Xι(i) = i .

These inner products make it possible to visualize the strength of correlations of all heads in a particular model T Mask Transformer N m,L,H with all target functions findj on a collection of random samples drawn from some Dhopk. Figure 11 visualizes the functionality of all attention units in the 4-layer, 4-head transformer T 4,4 when evaluated on the sub-task hop4. The figure gives several clues about how hop4 is successfully computed by the trained model: the second layer and third layer both utilize find1 to determined find2 jointly by the end of the third layer. The fourth layer uses the ability to create a stable find2 construction to obtain find4 and hence hop4.

This plot also indicates the relative stability of this circuit interpretation of the procedure: a large number of heads are very strongly correlated with find1 or find2 across the 10 samples, which indicates they are likely utilized consistently to compute those intermediates regardless of input.

Figure 12 is a similar plot for the transformer T 4,6 with depth L = 6, evaluated on the task hop16. The functionalities of the heads visualized in Figure 10 can be observed in the corresponding inner products. The collection of all inner products presents further evidence that the pointer-doubling phenomenon occurs in the trained models, due to the increase in compositions present in the largest inner products of deeper attention units.

While Figures 11 and 12 showcase the decisive alignment between self-attention heads and particular partial computations findj in successfully trained models, Figure 13 demonstrates the loss of that decisiveness in poorly performing transformers. There, we visualize the alignments of the trained depth-4 transformer T 4,4 evaluated on hop16, in which it attains a 61% token error. While a self-attention units in the second layer coincides with find1, no strong correlations emerge deeper in the model. Unlike the other figures, the deeper self-attention units are indecisive, lacking any large inner products and failing in particular to correlate with any highly compositional targets. This provides a visual explanation of the transformer s failure, since it lacked the effective representational capacity needed to learn a circuit with consistent and highly-compositional outputs.9

Alignment between Attention Heads and findj for all hopk Sub-Tasks (Figures 14 and 15). For an even more global lens on the mechanistic interpretability of these trained models, we visualize how the maximum inner products of each self-attention unit change for a fixed transformer for different sub-tasks hopk. Figures 14 and 15 do so for the depth-4 and depth-6 networks respectively. The hue of each cell (and its numerical label) corresponds to the j with the most correlated inner product with corresponding attention unit Aℓ,h in samples from dom(hopk), and the opacity corresponds to the magnitude of that inner product.

The takeaways of the previous inner product figures are apparent in these: the approximate doubling for the depth L = 6 transformer can be visualized by the vertically changing opaque colors. Conversely, a separation can be observed between the tasks where the depth L = 4 transformer performs well and has decisive self-attention units deeper in the network and those where it does not.

Moreover, the figures (especially Figure 15) demonstrate that several self-attention units have a consistent function among samples from the same task, while adapting in function to different hopk tasks. This is most apparent in head h = 4 of layer ℓ= 6, where the self-attention head functions as find1, find3, find5 or find7 depending on the complexity of the task.

9Since these experiments are in the small alphabet size |Σ| = 4 regime, this task performs better than random guessing due to inferential capabilities that are are powered by the high embedding dimension and do not require implementing a pointer-chasing algorithm. We suspect that the checkerboard patterns are powered by this inference.

Transformers, Parallel Computation, and Logarithmic Depth

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 j

Layer , head h

0.17 0.1 0.01 0.01

0.32 0.15 0.08 0.05 0.03 0.02 0.01 0.01

0.04 0.19 0.08 0.08 0.04 0.03 0.01 0.01

0.77 0.05 0.11 0.02 0.02

0.52 0.04 0.14 0.04 0.04 0.02 0.02 0.01 0.01 0.01

0.19 0.18 0.15 0.08 0.05 0.02 0.01

0.23 0.09 0.15 0.05 0.04 0.02 0.01

0.18 0.77 0.03

0.8 0.17 0.01

A , h, findj n, 4 for depth-4 transformer and X dom(hop4)

Figure 11. Plots of all inner products Aℓ,h[T 4,4], findj

10,4 for n = 10 samples X1, . . . , X10 dom(hop4) for the 4-layer transformer T 4,4.

Transformers, Parallel Computation, and Logarithmic Depth

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 j

Layer , head h

0.33 0.16 0.09 0.05 0.03 0.02 0.01 0.01

0.24 0.08 0.05 0.02 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01

0.88 0.06 0.01

0.08 0.22 0.06 0.06 0.02 0.01 0.01

0.26 0.06 0.07 0.03 0.03 0.03 0.02 0.02 0.02 0.03 0.03 0.03 0.03 0.03 0.03 0.02

0.21 0.14 0.13 0.06 0.04 0.02 0.01 0.01 0.01 0.01

0.14 0.11 0.08 0.05 0.03 0.02 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01

0.07 0.79 0.06

0.04 0.81 0.04

0.03 0.89 0.01

0.02 0.9 0.01

0.02 0.07 0.77 0.03

0.02 0.01 0.12 0.01 0.65 0.04

0.03 0.01 0.14 0.74 0.13

0.01 0.01 0.01 0.01 0.04 0.82 0.02

A , h, findj n, 16 for depth-6 transformer and X dom(hop16)

Figure 12. Plots of all inner products Aℓ,h[T 4,6], findj

10,16 for n = 10 samples X1, . . . , X10 dom(hop16) for the 6-layer transformer T 4,6.

Transformers, Parallel Computation, and Logarithmic Depth

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 j

Layer , head h

0.23 0.13 0.01 0.01

0.21 0.09 0.05 0.03 0.02 0.01 0.01 0.01

0.04 0.22 0.1 0.11 0.06 0.05 0.03 0.02 0.01 0.01

0.64 0.05 0.19 0.03 0.05 0.01 0.01 0.01

0.51 0.05 0.17 0.05 0.07 0.03 0.03 0.02 0.02 0.01 0.01 0.01 0.01

0.46 0.29 0.08 0.01

0.68 0.21 0.06 0.01

0.05 0.06 0.15 0.08 0.15 0.07 0.09 0.04 0.03 0.02 0.01 0.01

0.11 0.12 0.1 0.11 0.08 0.06 0.04 0.03 0.02 0.01 0.01

0.02 0.14 0.25 0.21 0.1 0.03 0.01

0.01 0.01 0.01 0.07 0.01 0.16 0.02 0.17 0.03 0.11 0.02 0.05 0.01 0.02

0.61 0.32 0.11 0.03 0.01

0.01 0.03 0.21 0.31 0.01 0.17 0.01 0.06 0.01

A , h, findj n, 16 for depth-4 transformer and X dom(hop16)

Figure 13. Plots of all inner products Aℓ,h[T 4,4], findj

10,16 for n = 10 samples X1, . . . , X10 dom(hop16) for the 4-layer transformer T 4,4.

Transformers, Parallel Computation, and Logarithmic Depth

k (Error errk

Layer , head h

1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2

1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2

1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

1 1 1 1 1 3 3 3 3 3 3 3 3 3 3 5

1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2

1 1 1 3 3 3 3 3 3 5 5 5 5 5 5 5

1 2 3 2 3 4 5 4 5 6 7 6 7 6 7 8

1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

2 2 2 2 2 2 4 4 4 4 6 4 6 6 6 6

argmaxj A , h, findj n, k for depth-4 transformer and X dom(hopk)

1 2 3 4 5 6 7 8 9 10

11 12 13 14 15 16

Figure 14. Plots of all the maximum inner products Aℓ,h[T 4,4], findj

n,k for n = 10 fixed samples X1, . . . , X10 dom(hopk) for each k [16] for the 4-layer transformer T 4,4. The hue corresponds to the index of the largest inner product j = arg maxj Aℓ,h[T 4,4], findj

n,k, while the opacity is determined by the magnitude of the correlation.

Transformers, Parallel Computation, and Logarithmic Depth

k (Error errk

Layer , head h

1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2

1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2

1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

3 2 1 3 3 3 3 3 3 3 3 3 3 3 3 3

1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

1 1 1 1 1 3 3 3 3 3 3 5 5 5 5 5

1 1 1 1 1 1 1 1 1 3 3 3 3 3 3 3

1 1 1 1 1 1 1 3 3 3 3 3 3 3 3 3

1 2 1 2 3 2 3 4 5 4 5 6 7 6 7 8

2 1 2 3 2 3 4 5 4 5 6 7 6 7 8 9

2 1 1 1 1 1 1 1 1 3 3 3 3 5 5 5

2 1 1 1 1 3 3 3 3 5 5 5 5 7 7 7

argmaxj A , h, findj n, k for depth-6 transformer and X dom(hopk)

1 2 3 4 5 6 7 8 9 10

11 12 13 14 15 16

Figure 15. Plots of all the maximum inner products Aℓ,h[T 4,6], findj

n,k for n = 10 fixed samples X1, . . . , X10 dom(hopk) for each k [16] for the 6-layer transformer T 4,6.

Transformers, Parallel Computation, and Logarithmic Depth

G.5 Finite-Sample Experiments (Empirical Claim G.3; Figures 16 to 19)

While most of our multi-hop experiments reside in the infinite-sample regime (where new samples are generated for every batch), we also trained several transformers on ntrain {1000, 3000} samples to evaluate whether generalization is possible in this domain, especially when the number of model parameters far exceeds the number of training samples. The two training set sizes expose a sharp threshold between two different generalization modes: low accuracy due to overfitting for most models on most tasks when ntrain = 1000 and high accuracy approaching the infinite-sample regime when ntrain = 3000.

Figure 16 compares the infinite-sample transformers T 4,L with the 3000-sample models T 3000 4,L . 3000 training samples are sufficient to obtain comparable (if slightly worse) generalization error rates across model depths L and task complexities k. This supports a hypothesis that the existence of a small transformer that perfectly fits the data enables larger transformers to actually realize such architectures in the over-parameterized regime.

On the other hand, Figure 17 demonstrates that transformers trained on ntrain = 1000 samples suffer poor performance on most tasks due to overfitting. While all models perform poorly on hopk sub-tasks for large k, a depth-separation exists for simpler sub-tasks like hop3. This suggests a positive inductive bias of deep transformers for simple compositional decision rules, which enables far better performance than other models in the overfitting regime.

To investigate this gap in performance, we contrast the self-attention inner products of depth-4 T 1000 4,4 and depth-6 T 1000 4,6 on the task hop3 in Figures 18 and 19. The 6-layer model obtains a far superior classification error on the sub-task, and the interpretability plot establishes a plausible circuit it implements: It uses self-attention heads with find1 functionality consecutively in layers 4, 5, and 6, which enables the robust retrieval of find3 and hop3. On the other hand, the 4-layer plot exhibits poor performance and only has two layers with find1 functionality; this justifies the relatively strong performance of T 1000 4,4 on hop2 and its poor performance on hop3.

While neither model learns any kind of pointer-doubling construction, the 6-layer model is still able to learn a simple construction of hop3 that the 4-layer model misses. The representational suitability of deeper models to compositional reasoning may thus provide a favorable inductive bias for learning the task in a setting with little data.

1 2 4 8 16 k

Error: errk

Evaluation of L-depth, 4-headed, 3000-sample intransformers on hopk

L = 2, n = 3000 L = 2, n = L = 3, n = 3000

L = 3, n = L = 4, n = 3000 L = 4, n =

L = 5, n = 3000 L = 5, n =

L = 6, n = 3000 L = 6, n =

Figure 16. Comparison between the errors errn k(T n 4,L) of transformers trained in the infinite sample regime (dashed line) and on ntrain = 3000 samples (solid line) on the multi-hop task, evaluated on n = 100 samples per hopk task.

Transformers, Parallel Computation, and Logarithmic Depth

1 2 4 8 16 k

Error: errk

Evaluation of L-depth, 4-headed, 1000-sample intransformers on hopk

L = 2, n = 1000 L = 2, n = L = 3, n = 1000

L = 3, n = L = 4, n = 1000 L = 4, n =

L = 5, n = 1000 L = 5, n =

L = 6, n = 1000 L = 6, n =

Figure 17. Comparison between the errors errn k(T n 4,L) of transformers trained in the infinite sample regime (dashed line) and on ntrain = 1000 samples (solid line) on the multi-hop task, evaluated on n = 100 samples per hopk task.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 j

Layer , head h

0.15 0.09 0.1 0.08 0.07 0.06 0.05 0.05 0.04 0.04 0.03 0.03 0.03 0.02 0.02 0.02

0.31 0.06 0.04 0.03 0.02 0.02 0.02 0.02 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01

0.04 0.04 0.03 0.03 0.03 0.02 0.02 0.02 0.02 0.01 0.01 0.01 0.01 0.01 0.01 0.01

0.32 0.07 0.05 0.04 0.03 0.03 0.02 0.02 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01

0.22 0.1 0.05 0.04 0.03 0.02 0.01 0.01 0.01

0.09 0.12 0.07 0.09 0.06 0.06 0.05 0.05 0.05 0.04 0.04 0.04 0.03 0.03 0.03 0.03

0.28 0.11 0.12 0.08 0.07 0.05 0.05 0.04 0.03 0.03 0.02 0.02 0.02 0.02 0.02 0.02

0.31 0.13 0.11 0.06 0.05 0.03 0.02 0.02 0.02 0.01 0.01 0.01 0.01 0.01 0.01 0.01

0.19 0.13 0.12 0.09 0.07 0.05 0.04 0.03 0.02 0.02 0.01 0.01 0.01 0.01

0.05 0.2 0.06 0.07 0.03 0.02 0.01 0.01 0.01 0.01

0.88 0.06 0.01

0.86 0.07 0.01

0.85 0.09 0.01

0.61 0.04 0.12 0.02 0.03 0.01 0.01

0.33 0.06 0.17 0.06 0.08 0.04 0.04 0.03 0.02 0.01 0.01 0.01

0.03 0.4 0.02 0.09 0.02 0.03 0.01 0.01

A , h, findj n, 3 for depth-4 transformer and X dom(hop3), trained on ntr = 1000 samples

Figure 18. Plots of all inner products Aℓ,h[T 1000 4,4 ], findj

10,3 for n = 10 samples X1, . . . , X10 dom(hop3) for the 4-layer trans-

former T 1000 4,4 .

Transformers, Parallel Computation, and Logarithmic Depth

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 j

Layer , head h

0.33 0.04 0.03 0.02 0.02 0.01 0.01 0.01 0.01 0.01 0.01

0.24 0.14 0.11 0.09 0.07 0.06 0.05 0.04 0.04 0.03 0.03 0.02 0.02 0.02 0.02 0.02

0.26 0.09 0.07 0.06 0.05 0.05 0.04 0.03 0.02 0.02 0.02 0.02 0.01 0.01 0.01 0.01

0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01

0.18 0.12 0.09 0.06 0.05 0.04 0.03 0.03 0.02 0.02 0.02 0.02 0.02 0.02 0.02 0.02

0.32 0.13 0.12 0.06 0.05 0.04 0.03 0.02 0.02 0.02 0.01 0.01 0.01 0.01 0.01 0.01

0.26 0.15 0.09 0.06 0.04 0.02 0.02 0.01 0.01 0.01

0.16 0.1 0.11 0.08 0.07 0.06 0.05 0.05 0.04 0.04 0.03 0.03 0.03 0.03 0.02 0.02

0.15 0.14 0.04 0.03 0.01 0.01

0.25 0.09 0.06 0.02 0.01 0.01 0.01 0.01 0.01 0.01

0.32 0.13 0.11 0.06 0.05 0.03 0.02 0.02 0.02 0.01 0.01 0.01 0.01 0.01 0.01 0.01

0.13 0.11 0.08 0.06 0.05 0.04 0.03 0.02 0.02 0.02 0.02 0.01 0.01 0.01 0.01 0.01

0.13 0.19 0.09 0.08 0.04 0.03 0.02 0.01 0.01 0.01

0.42 0.08 0.1 0.03 0.03 0.02 0.01 0.01 0.01

0.11 0.07 0.16 0.07 0.09 0.05 0.04 0.03 0.02 0.02 0.01 0.01 0.01 0.01

0.12 0.55 0.04 0.1 0.02 0.02 0.01 0.01 0.01

A , h, findj n, 3 for depth-6 transformer and X dom(hop3), trained on ntr = 1000 samples

Figure 19. Plots of all inner products Aℓ,h[T 1000 4,6 ], findj

10,3 for n = 10 samples X1, . . . , X10 dom(hop3) for the 6-layer trans-

former T 1000 4,6 .