# do_efficient_transformers_really_save_computation__34980470.pdf

Do Efficient Transformers Really Save Computation?

Kai Yang 1 Jan Ackermann 2 Zhenyu He 3 Guhao Feng 1 Bohang Zhang 3 Yunzhen Feng 4 Qiwei Ye 5

Di He 3 Liwei Wang 3 6

Abstract As transformer-based language models are trained on increasingly large datasets and with vast numbers of parameters, finding more efficient alternatives to the standard Transformer has become very valuable. While many efficient Transformers and Transformer alternatives have been proposed, none provide theoretical guarantees that they are a suitable replacement for the standard Transformer. This makes it challenging to identify when to use a specific model and what directions to prioritize for further investigation. In this paper, we aim to understand the capabilities and limitations of efficient Transformers, specifically the Sparse Transformer and the Linear Transformer. We focus on their reasoning capability as exhibited by Chainof-Thought (Co T) prompts and follow previous works to model them as Dynamic Programming (DP) problems. Our results show that while these models are expressive enough to solve general DP tasks, contrary to expectations, they require a model size that scales with the problem size. Nonetheless, we identify a class of DP problems for which these models can be more efficient than the standard Transformer. We confirm our theoretical results through experiments on representative DP tasks, adding to the understanding of efficient Transformers practical strengths and weaknesses.

1. Introduction

The Transformer architecture, as introduced in the seminal work of Vaswani et al. (2017), has demonstrated a re-

1School of EECS, Peking University 2ETH Z urich 3National Key Laboratory of General Artificial Intelligence, School of Intelligence Science and Technology, Peking University 4New York University 5Beijing Academy of Artificial Intelligence 6Center for Machine Learning Research, Peking University. Correspondence to: Di He <dihe@pku.edu.cn>, Liwei Wang <wanglw@pku.edu.cn>, Bohang Zhang <zhangbohang@pku.edu.cn>.

Proceedings of the 41 st International Conference on Machine Learning, Vienna, Austria. PMLR 235, 2024. Copyright 2024 by the author(s).

markable performance in numerous applications ranging from natural language processing to computer vision and speech. A significant advancement has recently been made, by scaling up Transformers to build Large Language Models (LLMs) (Brown et al., 2020; Open AI, 2023; Touvron et al., 2023). These LLMs, exemplified by models like GPT and LLa Ma, typically have billions of parameters and are trained on datasets containing trillions of tokens. Given the substantial computational demands, enhancing LLMs efficiency has become a pivotal research focus in academic and industrial contexts.

The primary computational bottleneck in Transformers arises from the self-attention module, whose complexity scales quadratically with the sequence length. The cost becomes particularly noticeable in tasks that require long sequence generation, such as coherent story generation or reasoning with Chain-of-Thought prompts (Wei et al., 2022b; Kojima et al., 2022; Nye et al., 2022; Zhou et al., 2023). Given the practical needs, a large body of work seeks to develop efficient Transformers that can reduce the quadratic complexity of self-attention (Tay et al., 2022), typically by imposing sparsity into architectural design (Child et al., 2019; Beltagy et al., 2020; Qiu et al., 2020; Kitaev et al., 2020; Vyas et al., 2020; Roy et al., 2021) or by employing low-rank or kernel-based approximations to accelerate the computation (Katharopoulos et al., 2020; Choromanski et al., 2021; Peng et al., 2021; Wang et al., 2020; Luo et al., 2021). However, there is generally a lack of understanding about the capabilities of efficient Transformer.

In this work, we take a step towards theoretically understanding the capability of efficient Transformers. In particular, we focus on the models reasoning ability, a fundamental aspect of human intelligence that plays a vital role in problem-solving, decision-making, and planning. Inspired by a recent study in Feng et al. (2023), we model reasoning as a dynamic programming (DP) process as it closely resembles the way Chain-of-Thought prompts are executed. The output sequence consists of answers to a series of intermediate steps, each corresponding to solving a subproblem represented by a DP state. Feng et al. (2023) proved that all reasoning problems fitting within this framework can be solved by a standard autoregressive Transformer of a constant size (irrelevant to the problem scale), thus achieving a

Do Efficient Transformers Really Save Computation?

Table 1. Complexity of the Transformer variants on different tasks. Architecture General Reasoning Arithmetic Reasoning (locality assumption) Standard Transformer Θ(L2) Θ(L2) Θ(L2) Sparse Transformer Θ(L2) Θ(L

L) if m = O(

L) Linear Transformer Θ(L2) Ω(L

computational complexity of Θ(L2) where L is the length of the output sequence.

In our work, we focus on two representative (Tay et al., 2022) and successful (Tay et al., 2020b; Brown et al., 2020) variants of efficient Transformers: the Sparse Transformer (Child et al., 2019) and the Linear Transformer (Katharopoulos et al., 2020). (In the following we will refer to these two as efficient Transformers.) Our analysis shows that both architectures possess the necessary expressiveness for all problems within this DP framework despite only scaling with Θ(L

L) and Θ(L). Although this positive result might lead one to believe that we can supplant the standard Transformer with others of lower complexity, the situation is more complicated: our main result highlights that both Sparse Transformer and Linear Transformer require a growing model size with respect to the problem scale L, in contrast to the constant size of standard Transformers. Specifically, under mild assumptions, we prove that neither architecture can generate the DP solution unless the hidden dimension of the network layers scales as Ω(

L). This scaling results in a total computational complexity of Ω(L2), matching the vanilla Transformer s complexity. But this result introduces a paradox: when tackling general DP problems, the touted efficiency of these efficient Transformers appears to dissolve, rendering them comparably efficient to standard Transformers.

The above findings about general DP problems raise the question: For which problems are efficient Transformers efficient? To answer this question, we start by studying a fundamental task for reasoning: evaluating arithmetic expressions (Feng et al., 2023). Notably, we find that the complexity lower bound can be improved to Ω(L

L) for both architectures, and the lower bound can be attained for the Sparse Transformer with a constant hidden dimension. Motivated by this finding, we then identify a general condition that unlocks the efficiency of efficient Transformers, called the locality assumption. Intuitively, this assumption states that each step in the reasoning process only depends on the outcome of recent m reasoning steps where m is far smaller than L, i.e. m = o(L). Under this assumption, we show that the complexity lower bound can be improved for sparseand Linear Transformer. We summarize our theoretical results in Table 1.

We complement our theoretical findings with an extensive set of experiments. Following Feng et al. (2023), we focus on the Arithmetic task and two additional DP problems: the Longest Increasing Subsequence (LIS) and the Edit Distance

(ED). Notably, the ED task satisfies the locality assumption, whereas the LIS task does not. For each task, we systematically investigate how variations in the problem size (i.e., the sequence length L) and the hidden dimension of the Transformer models impact the models performance. Empirical evidence confirms that, for both efficient Transformers, the required hidden dimension increases as the problem size grows in most scenarios, while this is not the case for the standard Transformer. Moreover, the dependency between hidden dimension and problem scale is more pronounced in LIS than in ED. These results validate our theory and offer practical insights into the strengths and weaknesses of efficient Transformers.

Notations. We adopt the big-O notation throughout this paper. Specifically, given two functions f, g : X [0, ) where X can be any set, we write f = O(g) if there exists a constant c > 0 such that f(x) cg(x) for all x X. We also write f = Ω(g) if g = O(f), and write f = Θ(g) if both f = O(g) and f = Ω(g) hold. Moreover, given two functions f, g : Nd + [0, ), we write f = O(g) if there exist constants c, k > 0 such that f(x) cg(x) logk(x1 xd) for all x Nd +. The notations Ω( ) and Θ( ) can be similarly defined.

2. Related Work

Transformers and Large Language Models have received significant attention due to their unprecedented success across various domains. A considerable body of literature has emerged to establish a deeper theoretical understanding of their strengths and constraints.

Universal Approximation. Initially, the theoretical focus was on the capacity of Transformers to approximate diverse functions. Yun et al. (2019) postulated that adequately sized Transformers can universally approximate any continuous sequence-to-sequence functions within certain bounds. A parallel line of work first showed that Transformers with infinite precision are turing-complete (P erez et al., 2019; 2021) and later Wei et al. (2022a) established that Transformers with finite precision are approximately turing-complete. Recently, Alberti et al. (2023) proved that Linear Transformers are also universal approximators. Whereas these results approach expressiveness by proving computational capacity, we complement our expressiveness results with complexity lower bounds for practical settings.

Formal Language Learning. Additionally, the Transformer s expressivity has been studied in the context of

Do Efficient Transformers Really Save Computation?

formal language learning. Bhattamishra et al. (2020) constructed a Transformer that detects counter languages, and Yao et al. (2021) show how to detect Dyck languages. Liu et al. (2022) show that shallow Transformers can learn finite state automata and simulate them for a number of steps that scale with the model size. Conversely, Hahn (2020) shows that transformers can not learn distributions over languages. Other works use classical techniques from circuit complexity (Furst et al., 1984) to prove that Transformers can simulate classes of circuits (Hao et al., 2022; Merrill et al., 2022; Merrill & Sabharwal, 2023).

Measuring Complexity. Weiss et al. (2021) introduce a programming language that maps to learnable Transformer encoders and facilitates the analysis of the complexity of problems with respect to layers and attention heads. Sanford et al. (2023) introduce a sparse averaging task that requires recurrent and feed-forward networks to be of linear complexity, whereas the Transformer only needs to scale logarithmically. These works are similar to ours in that we establish concrete relationships between model complexity and solvability of the posed problems. But our work deals with autoregressive efficient Transformers equipped with Chain-of-Thought.

In-context learning. A recent approach shows its in-context learning ability (Garg et al., 2022; Brown et al., 2020). Following this, there are also theoretical results that (Dai et al., 2023; Von Oswald et al., 2023; Aky urek et al., 2022) prove it can perform gradient descent. Another line of work shows in-context-learning via induction heads (Elhage et al., 2021; Olsson et al., 2022). Similarly, Feng et al. (2023) show that auto regressive transformers can learn to perform dynamic programming when equipped with Chain-of-Thought. While in the same setting as Feng et al., we investigate efficient Transformers and present a problem class that encourages efficiency.

Efficient Transformer. Due to the high complexity of the attention layer, many more efficient methods have been proposed. A first series of ideas exploit fixed attention patterns (Child et al., 2019; Beltagy et al., 2020; Qiu et al., 2020). Another line of work approximates the attention as a low rank matrix or with kernels (Katharopoulos et al., 2020; Wang et al., 2020; Choromanski et al., 2021) and further works deal with learned patterns (Kitaev et al., 2020; Tay et al., 2020a; Roy et al., 2021). A last set of works even completely move away from transformers (Sun et al.; Gu & Dao, 2023). Two recent works study when standard attention can be efficient (Alman & Song, 2023) and how to approximate standard attention in linear time (Keles et al., 2023). In contrast to their work, we give theoretical analyses for existing and popular efficient Transformers.

3. Efficient Transformers

The autoregressive Transformer, also called the decoderonly Transformer (Radford et al., 2019; Dai et al., 2019), is a sequence-to-sequence neural network defined as follows. Given an input sequence s of length n, it first transforms each input token si (i [n]) into a D-dimensional vector x(0) = Embed(si)+pi RD, where Embed( ) is the token embedding layer and pi a learnable positional embedding. Then, M Transformer blocks follow, the l-th of which has the following form:

h(l) i = x(l 1) i + Attn(l)(x(l 1) i ; {x(l 1) j : j [i]}), (1)

x(l) i = h(l) i + FFN(l)(h(l) i ) (2)

Here, Attn(l) and FFN(l) denote the multi-head selfattention layer and the feed-forward network of the l-th Transformer block, respectively:

Attn(l)(x, S) =

W (l,h) O H(l,h)(x, S), (3)

H(l,h)(x, S)=

P z S exp (W (l,h) K z) (W (l,h) Q x) W (l,h) V z P

z S exp (W (l,h) K z) (W (l,h) Q x) ,

FFN(l)(x) = W (l) 2 σ(W (l) 1 x), (5)

where W (l,h) Q , W (l,h) K , W (l,h) V , W (l,h) O R D

H D are the query, key, value, output matrices of the h-th head in the l-th layer, respectively, and W (l) 1 , W (l) 2 RD D are weight matrices in the FFN. The activation σ is chosen as Ge LU (Hendrycks & Gimpel, 2016), following (Radford et al., 2019; Devlin et al., 2019). The computed embedding x(M) n will be used to predict the next token sn+1, which is then concatenated to the input to continue the sequence generation process. The process stops when an End-of-Sentence token is generated.

Based on Equations (1) to (3) and (5), it is easy to see that the computational complexity of an autoregressive Transformer is Θ(M(L2D + LD2)), where L is the sequence length. This quadratic dependency on L limits the application of Transformers to long text, in particular for complex reasoning tasks. To battle this, researchers have proposed various efficient Transformers to reduce the complexity. In our work, we investigate the Sparse Transformer and the Linear Transformer. Below, we describe the two architectures which are studied in this paper.

Sparse Transformer. Unlike the standard Transformer where each token x(l) can attend to all previous positions {x(l) j : j [i]} (see Equation (1)), in a Sparse Transformer

it only attends to a subset of previous tokens {x(l) j : j Ii}.

Do Efficient Transformers Really Save Computation?

In this paper, we study a standard design paradigm proposed in Child et al. (2019), which employs a block-wise pattern as shown in the following:

Ii = {j : i k B < j i} {j : j 1 mod B B c} (6)

where B is called the block size and k, c are constant integers. When B = Θ(

L), the Sparse Transformer achieves a minimal complexity of Θ(M(L

LD + LD2)). We note that GPT-3 adopted the above design paradigm (Brown et al., 2020).

Linear Transformer. Another line of work proposed to accelerate the attention computation (Equation (3)) using kernel-based approximations. A representative approach is the Linear Transformer (Katharopoulos et al., 2020), which approximates Attn(l) with the following formula:

Attn(l) linear(x, S) =

W (l,h) O H(l,h) linear(x, S), (7)

H(l,h) linear(x, S) =

z S ϕ(W (l,h) K z) ϕ(W (l,h) Q x)(W (l,h) V z) P

z S ϕ(W (l,h) K z) ϕ(W (l,h) Q x) (8)

where they choose ϕ(x) = elu(x) + 1. The above computation can be accelerated by rearranging the order of computation so that the intermediate results P z S(W (l,h) V z)ϕ(W (l,h) K z) and P z S ϕ(W (l,h) K z)

associated with different S can be jointly computed using prefix sum, finally yielding a complexity of Θ(MLD2) which is linear in L.

4. Expressiveness of Efficient Transformers in Reasoning Tasks

Reasoning constitutes a fundamental aspect of human intelligence and plays a vital role in problem-solving, decisionmaking, and planning. Recently, Transformer-based LLMs have demonstrated remarkable reasoning abilities (Open AI, 2023; Touvron et al., 2023). This has sparked a series of studies aimed at theoretically understanding how powerful these models are. In particular, Feng et al. (2023) recently revealed that autoregressive Transformers are capable of solving a general class of reasoning problems formalized as Dynamic Programming (DP). In this section, we extend this finding by investigating how things change when moving to various types of efficient Transformers.

4.1. Problem formulation

Dynamic programming decomposes a complex reasoning problem into a sequence of reasoning steps, each of which corresponds to a subproblem and is called a DP state. Different subproblems depend on each other because they can be

efficiently solved based on the answers of previously solved subproblems. Formally, denoting by dp(i) the answer of subproblem i, then the relation between subproblems can be characterized using a transition function:

dp(i) = f i, dp(h1(i)), , dp(h K(i)), sg1(i), , sg J(i)

(9) where s is the input sequence, and f, g1, , g J, h1, , h K are functions that depends on the problem. In other words, the answer of each subproblem is fully determined by the answers of a finite number of previous subproblems plus a finite number of input tokens. Based on Equation (9), we can sequentially solve all subproblems one by one. After solving all subproblems, the final answer can be computed by u(dp(i N)), where i N is the last DP state and u is a problem-dependent function. By defining our problem so generally, we also cover Co T problems. We assume that the f, g, h and u above can be approximated by an MLP with Ge LU activation of constant size. We also assume that during the Co T generation process, the next state can be obtained by an MLP where the input is the current state. One can refer to Appendix B for a formal description. We argue that these assumptions are mild and that they have been used in previous work (Feng et al., 2023).

In our subsequent analysis, without loss of generality, we assume that each input element sj is an integer, and each state i, DP value dp(i), and the final answer can all be represented by vectors of integer elements. The domain of these integers can grow polynomially with respect to the length L.

Output format. Following Feng et al. (2023), given a DP task and an input seuqence s, an autoregressive Transformer generates the answer with all intermediate steps in the following form:

(s1, 0, 0, 0) . . . (sn, 0, 0, 0) |

(0, i1, dp(i1), 0) . . . (0, i N, dp(i N), 0)

(0, 0, 0, u(dp(i N))) (10)

Here, the subsequence ending at the special token | is the input to the Transformer, and the remainder will be autoregressively generated. The output at each position is split into four parts that store the input, state, DP value, and final answer, respectively. We denote by i1, , i N the sequence of DP states representing all subproblems in order. We consider the regression setting where the output at each position is simply obtained from the embedding of the last Transformer layer by projecting each dimension to the nearest integer. Similarly, each generated output directly serves as the input of the next position (without using a token embedding layer).

Log-precision Transformers. We adopt a realistic and widely-used setting where all internal neurons in the Transformer can only store floating-point numbers within a finite

Do Efficient Transformers Really Save Computation?

O(log L) bit precision (Merrill & Sabharwal, 2023; Liu et al., 2023; Feng et al., 2023), and all basic floating-point computations are truncated, as implemented on a computer. Log-precision implies that each neuron has a limited capacity for computation and information storage. Nevertheless, they remain powerful as they can represent a large range of values (i.e., polynomial in the sequence length L), recovering important quantities like positional embedding.

Under the above assumptions, Feng et al. (2023) proved the following main result for the standard transformer: Theorem 4.1 (informal). Consider any DP problem defined above that satisfies the assumptions from B. For any integer n > 0, there exists a log-precision autoregressive Transformer with a constant depth M, a constant hidden dimension D, and a constant number of attention heads H (independent of n) that can generate the correct output for all inputs s of length n.

4.2. Main results

Now, we investigate whether the efficient Transformers defined in Section 3 are as powerful as the standard Transformer in solving DP problems. In particular, we establish a similar result to Theorem 4.1, which we present in the theorem below: Theorem 4.2. Consider any DP problem satisfying the same condition as in Theorem 4.1. Given any integer n > 0, let L be the length of the output sequence when the input sequence length is n. Then, for both (log-precision) Sparse Transformer with block size B = Θ(

L) and Linear Transformer, there is a model with a constant depth M, a constant number of attention heads H, and a hidden dimension D = O(

L) that can generate the correct output for all inputs s of length n.

The proof of Theorem 4.2 is non-trivial and is deferred to Appendix B.1. In the proof, we give explicit constructions of parameters for sparse/linear attention and FFN layers, showing that these layers can implement a set of basic operations presented in Appendix A. We then use these operations as building blocks to form a complete model that solves the DP task.

Theorem 4.2 suggests that replacing the standard selfattention with these efficient variants does not restrict the model s expressiveness in reasoning. However, when we compare the total complexity O(L2) of the derived models, we can see that it is the same as for the standard attention, which only needed d = O(1).

Is the increase in model size necessary? Theorem 4.2 only gives a complexity upper bound for Sparse/Linear Transformers. It remains to show whether the bound is tight and what the lower bound of the required hidden dimension is. To answer this question, we will focus on a restricted

class of DP problems which we call regular DP problems:

Definition 4.3. A DP problem is called regular if for any two different input sequences s(1) and s(2) (of the same length) and a fixed but arbitrary model that solves the DP problem, there is a state i such that dp(i) is different between input s(1) and s(2).

We remark that regularity is a weak assumption, which only states that the reasoning process (not the final answer) should not be exactly the same when the input changes. For example, it excludes the case where the whole DP process does not depend on a specific input element sj. Equipped with the regularity assumption, we present a central impossibility result:

Theorem 4.4. Consider any regular DP problem satisfying the same condition as in Theorem 4.1. Assume that the output sequence length L is proportional to the input sequence length n, i.e., L = Θ(n). Then, given a sufficiently large n, for both (log-precision) Sparse Transformer with block size B = Θ(

L) and Linear Transformer, a model with a constant depth M and a constant number of attention heads H can generate the correct output for all inputs s of length n only if the hidden dimension D = Ω(

As presented in Appendix B.2, the proof of Theorem 4.4 is based on the following finding: there are inherent information bottlenecks in both types of efficient Transformers. Here, the bottleneck is a set of neurons whose values completely determine all ensuing outputs from a specific position. Due to the log-precision assumption, these neurons only store a limited amount of information. Hence it is only possible to recover all subsequent outputs when the hidden dimension is Ω(

L) otherwise, the Pigeonhole principle will imply that there are two different input sequences that share the same set of neuron values, yielding the contradiction by Definition 4.3.

5. When Can Efficient Transformers Really Save Computation?

In the previous section, we showed the surprising result that these efficient Transformers may not lead to reduced complexity compared to the standard Transformers in general reasoning tasks. However, one should not hastily jump to the conclusion that these efficient Transformers are always inefficient. In this section, we will discuss in which situations efficient Transformers are efficient.

5.1. A motivating example: evaluating arithmetic expressions

We begin by investigating a less complex task proposed by Feng et al. (2023), called the arithmetic evaluation. The task is to evaluate an arithmetic expression like 2 (1 +5) 4

Do Efficient Transformers Really Save Computation?

and the complete output sequence looks like 2 (1 + 5) 4 = 2 6 4 = 12 4 = 3 . Feng et al. (2023) proved that a standard Transformer of a constant size can solve this task. Surprisingly, we find that a similar result also holds for a constant-size Sparse Transformer, as shown in the proposition below:

Proposition 5.1 (informal). For any integer n, there exists a log-precision Sparse Transformer with block size B = Θ(

L), 5 layers, 5 attention heads per layer, a constant hidden dimension that can generate the correct output for the arithmetic evaluation task for all expressions of length no more than n.

Owing to the constant dimension, the complexity of the arithmetic evaluation task can be reduced to O(L

L) by using Sparse Transformers. We next turn to Linear Transformers. While we do not give explicit constructions of model parameters that can solve the arithmetic task, one can still derive a lower bound by using a similar analysis as in Theorem 4.4:

Proposition 5.2 (informal). For any integer n, a logprecision Linear Transformer with a constant depth M and a constant number of heads H can generate the correct output for the arithmetic evaluation task for all expressions of length no more than n only if the hidden dimension D = Ω( 4

Based on this result, the complexity lower bound of Linear Transformers scales like Ω(L

L), which, interestingly, matches that of Sparse Transformers and is also strictly less than Ω(L2). This finding naturally raises the following question: Why do these efficient Transformers no longer require a large hidden dimension on the arithmetic task? We will answer this question in the next subsection.

5.2. Locality encourages efficiency

A key difference of the arithmetic task compared to general DP is that its reasoning process exhibits inherent structures. To be specific, the output sequence of arithmetic computation can be partitioned into blocks (separated by the symbol = ), where the content of each block depends solely on the preceding one and is irrelevant to other historical blocks. This paradigm is often named as data locality in computer science literature and is also common in general reasoning processes that follow the so-called Chain of Thought format (Wei et al., 2022b). In light of this, we consider a special class of DP problems dubbed the m-locality DP, which is formally defined below:

Definition 5.3 (m-locality DP). Consider a DP problem with output sequence o1, , o L of the form (10) where o1, , on is the input sequence. The DP problem is said to satisfy the m-locality condition for some m n, if there exist functions f, h1, , h K such that for all i [L],

oi = f(oh1(i), , oh K(i)), where i m hk(i) < i for k [K].

In other words, the m-locality condition simply says that each DP state only depends on recent m DP states. Note that the assumption m n is necessary to ensure that all inputs contribute to the answer of the DP problem. Below, we will discuss how the required hidden dimension of efficient Transformers can be reduced when m is far smaller than L.

We first consider the Sparse Transformer, where we have the following result:

Proposition 5.4. Consider any m-locality DP problem satisfying the same condition as in Theorem 4.1. Given any integer n > 0, let L be the length of the output sequence when the input sequence length is n. Then, there exists a (logprecision) Sparse Transformer with block size B = Θ(m), a constant depth M, a constant number of attention heads H, and a constant hidden dimension D that can generate the correct output for all inputs s of length n.

As a result, the complexity of Sparse Transformer scales like O(m L), which is strictly less than Θ(L2) when m is far smaller than L. We next turn to the Linear Transformer, where we have the following lower bound:

Proposition 5.5. Consider any m-locality regular DP problem satisfying the same condition as in Theorem 4.1 and assume that m = Θ(n) where n is the input sequence length. Then, a log-precision Linear Transformer with a constant depth M and a constant number of heads H can generate the correct output for all inputs s of length n only if the hidden dimension D = Ω( m).

The above result implies that the complexity lower bound of Linear Transformer, which is imposed by the bottleneck, scales like Ω(m L), which is strictly less than Θ(L2) when m is far smaller than L. However, we remark that it remains a challenging open question of whether such a complexity lower bound can be matched.

6. Experiments

In the preceding sections, we conducted a theoretical analysis to assess the capabilities of efficient Transformers for general DP problems and problems with locality. This section serves to validate those findings through comprehensive empirical experimentation. Inspired by the reasoning evaluation in (Feng et al., 2023), we adopt a similar experimental design using common DP problems with Chain-of-Thought demonstrations. We focus on understanding how two key factors, problem size and the embedding dimension of the Transformer model, affect performance across tasks.

Do Efficient Transformers Really Save Computation?

Figure 1. A comparison of accuracies across different tasks and model types. Each column corresponds to a task (Arithmetic, ED, LIS), and each row to a model (Standard Transformer, Linear Transformer, Sparse Transformer). Within each subplot, the x-axis represents the embedding dimension, and the y-axis denotes the problem size. The color intensities indicate the accuracy level achieved by the respective models. The figure demonstrates that efficient Transformers need larger hidden dimensions and that this requirement increases with problem size. It also highlights how standard Transformers can handle tasks across all difficulty levels with fixed embedding dimensions.

6.1. Experimental Design

Tasks and datasets. We chose three well-known tasks LIS, ED, and Arithmetic to represent a variety of problems in the DP domain. For the LIS task, the goal is to find the length of the longest increasing subsequence of a given integer sequence. The ED task s goal is to calculate the minimum cost required to convert one sequence to another using three basic edit operations: insert, delete, and replace. For the Arithmetic task, the goal is to calculate the correct result of an arithmetic expression consisting of numbers, addition, subtraction, multiplication, division, and brackets. The LIS task is the most general DP problem without locality property, while ED and Arithmetic exhibit higher locality.

Following previous work (Feng et al., 2023), we curate five datasets for each task, with different problem sizes and increasing difficulty. For the LIS task, the datasets encompass sequences of lengths {40, 110, 175, 250}, equating to Co T lengths of {83, 223, 353, 503}. For the ED task, the datasets span varying sequences of lengths specifically, of

averaged lengths 6, 12, 16, and 20, equating to maximum Co T lengths of {72, 210, 342, 506}. For the Arithmetic task, the dataset consists of sequences with operator numbers in {6, 10, 13, 16}, equating to maximum Co T lengths of {87, 221, 355, 505}. Each training dataset has 1M samples, and each corresponding testing dataset has 0.1M.

Model configurations. For the standard Transformer model, we use the same configurations as used by (Feng et al., 2023) with 3 layers and 4 attention heads, albeit with varying embedding dimensions. We employ sinusoidal positional embeddings and apply Xavier initialization to all parameters. For the activation function, we chose the standard Ge LU, and the embedding dimensions for the ED and LIS tasks span the range 32, 64, 128, 256, 512, 1024 uniformly. As for the Arithmetic task, we exclude the 1024 embedding dimensions as all the models have already performed well in 512. The FFN layer s hidden dimension is four times the embedding dimension. We use the same configurations for both the Linear and Sparse Transformers. Within the Sparse Transformer, the block size B is 2 log2(

L) , with L representing the upper limit of Co T length. Every experi-

Do Efficient Transformers Really Save Computation?

Table 2. Minimum GFLOPs of different model types to achieve an accuracy above 90% on Arithmetic Task.

Method Length 87 Length 221 Length 355 Length 505

Standard 0.027 0.266 0.426 0.605 Linear 0.422 4.220 6.756 9.584 Sparse 0.106 1.055 1.690 2.399

ment has the global token count c fixed at 1 for every block. Furthermore, we conduct the experiments on Transformer models with 5 layers while keeping other parameters the same. The additional results are shown in Appendix D.

Model training and inference. In all experiments, we employ the Adam W optimizer (Loshchilov & Hutter, 2017) with the following hyperparameters: β1 = 0.9, β2 = 0.999, lr = 10 4, and weight decay = 0.01. To enhance model generalization, we maintain a consistent dropout rate of 0.1. Our optimization minimizes the negative loglikelihood loss for all tokens in both the Co T steps and the answers. Each model does 100 training epochs with a batch size of 512. During the inference stage, the model generates the entire Co T process token by token, using greedy search until reaching the End-of-Sentence token. We evaluate the models performance using the accuracy of the final answer, which is the last output in the sequence. We run all experiments on four V100 GPUs.

6.2. Experimental Results

Figure 1 shows our main results. Each column of the figure corresponds to a particular task, and each row to a different model. Within each subplot, the x-axis shows the embedding dimension, and the y-axis indicates the problem size. The color intensities indicate the corresponding accuracies.

For almost all tasks and varying problem sizes, models with sufficiently large embedding dimensions can achieve nearly 100% accuracy, except for the LIS task with the Sparse Transformer. Nevertheless, for this task, the accuracy still increases as the embedding dimension increases. This finding shows that efficient Transformers can handle these DP tasks with adequate expressiveness.

When we compare the subplots column-wise, it is evident that efficient Transformers generally need larger hidden dimensions than standard Transformers. Moreover, within each subplot of efficient Transformers, the required embedding dimension increases as the problem grows. In contrast, standard Transformers, with fixed embedding dimensions of 128 or 256, can handle tasks across all difficulty levels. These observations confirm our theoretical findings, suggesting that efficient Transformers are less efficient than previously perceived. In Table 2, we further compare the minimal number of FLOPs required to achieve 90% accuracy on Arithmetic for all models. The standard Transformer requires the lowest number of flops across all lengths.

Figure 2. Accuracies of the Sparse Transformer and Linear Transformer on the ED Local task with varying problem size on the yand embedding dimension on the x-axis. We can observe that both models benefit from locality.

Comparing the subplots row-wise, the growth in required embedding dimension with the model size becomes more pronounced as the locality decreases. This suggests that efficient Transformers are more efficient for DP tasks with strong locality, which aligns with our previous results.

Locality Study. Although the previous results already indicate that problems with higher locality require higher embedding dimensions, we conducted another more explicit experiment in which we modified the ED problem to have a higher locality. We call this problem ED Local and show the results in Figure 2. We can clearly see that the increased locality led to a reduction in necessary embedding size, which backs up our theoretical findings.

7. Limitations & Conclusion

Limitations. Although we show our results for representative efficient Transformers, it does not mean that our findings directly transfer to all models with similar designs. Further, despite our experiments indicating that Linear Transformers also benefit from locality, it remains to prove whether the bound for the Linear Transformer can be tightened.

Conclusion. While the Sparse Transformer and Linear Transformer are expressive enough to solve general DP tasks, our findings indicate that they can not sustain their efficiency in the general case. This contradicts the anticipated efficiency gains, bringing their performance closer to that of standard Transformers, which maintain a constant model size. The paradoxical nature of these efficient Transformers prompts a crucial question: under what conditions do these

Do Efficient Transformers Really Save Computation?

architectures become efficient? By delving into arithmetic expression evaluation and introducing the locality assumption, we identify scenarios where efficient Transformers can be efficient. Our theoretical results find empirical support through extensive experiments on tasks like Arithmetic, Longest Increasing Subsequence (LIS), and Edit Distance (ED). The observed dependency between hidden dimension and problem scale for efficient Transformers, in contrast to the stability in standard Transformers, validates our theory.

Impact Statement

This paper presents work whose goal is to advance the field of Machine Learning. There are many potential societal consequences of our work, none which we feel must be specifically highlighted here.

Acknowledgement

Liwei Wang is supported by National Science and Technology Major Project (2022ZD0114902) and National Science Foundation of China (NSFC62276005). Di He is supported by National Science Foundation of China (NSFC62376007).

Aky urek, E., Schuurmans, D., Andreas, J., Ma, T., and Zhou, D. What learning algorithm is in-context learning? investigations with linear models. ar Xiv preprint ar Xiv:2211.15661, 2022.

Alberti, S., Dern, N., Thesing, L., and Kutyniok, G. Sumformer: Universal approximation for efficient transformers. In Topological, Algebraic and Geometric Learning Workshops 2023, pp. 72 86. PMLR, 2023.

Alman, J. and Song, Z. Fast attention requires bounded entries. ar Xiv preprint ar Xiv:2302.13214, 2023.

Beltagy, I., Peters, M. E., and Cohan, A. Longformer: The long-document transformer. ar Xiv:2004.05150, 2020.

Bhattamishra, S., Ahuja, K., and Goyal, N. On the ability and limitations of transformers to recognize formal languages. ar Xiv preprint ar Xiv:2009.11264, 2020.

Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J. D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al. Language models are few-shot learners. In Advances in neural information processing systems, volume 33, pp. 1877 1901, 2020.

Child, R., Gray, S., Radford, A., and Sutskever, I. Generating long sequences with sparse transformers. ar Xiv preprint ar Xiv:1904.10509, 2019.

Choromanski, K. M., Likhosherstov, V., Dohan, D., Song, X., Gane, A., Sarlos, T., Hawkins, P., Davis, J. Q., Mohiuddin, A., Kaiser, L., et al. Rethinking attention with performers. In International Conference on Learning Representations, 2021.

Dai, D., Sun, Y., Dong, L., Hao, Y., Ma, S., Sui, Z., and Wei, F. Why can gpt learn in-context? language models implicitly perform gradient descent as meta-optimizers. In ICLR 2023 Workshop on Mathematical and Empirical Understanding of Foundation Models, 2023.

Dai, Z., Yang, Z., Yang, Y., Carbonell, J., Le, Q. V., and Salakhutdinov, R. Transformer-xl: Attentive language models beyond a fixed-length context. ar Xiv preprint ar Xiv:1901.02860, 2019.

Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp. 4171 4186. Association for Computational Linguistics, 2019.

Elhage, N., Nanda, N., Olsson, C., Henighan, T., Joseph, N., Mann, B., Askell, A., Bai, Y., Chen, A., Conerly, T., et al. A mathematical framework for transformer circuits. Transformer Circuits Thread, 1, 2021.

Feng, G., Gu, Y., Zhang, B., Ye, H., He, D., and Wang, L. Towards revealing the mystery behind chain of thought: a theoretical perspective. Advances in Neural Information Processing Systems, 2023.

Furst, M., Saxe, J. B., and Sipser, M. Parity, circuits, and the polynomial-time hierarchy. Mathematical systems theory, 17(1):13 27, 1984.

Garg, S., Tsipras, D., Liang, P., and Valiant, G. What can transformers learn in-context? a case study of simple function classes. In Advances in Neural Information Processing Systems, 2022.

Gu, A. and Dao, T. Mamba: Linear-time sequence modeling with selective state spaces. ar Xiv preprint ar Xiv:2312.00752, 2023.

Hahn, M. Theoretical limitations of self-attention in neural sequence models. Transactions of the Association for Computational Linguistics, 8:156 171, 2020.

Hao, Y., Angluin, D., and Frank, R. Formal language recognition by hard attention transformers: Perspectives from circuit complexity. Transactions of the Association for Computational Linguistics, 10:800 810, 2022.

Do Efficient Transformers Really Save Computation?

Hendrycks, D. and Gimpel, K. Gaussian error linear units (gelus). ar Xiv preprint ar Xiv:1606.08415, 2016.

Katharopoulos, A., Vyas, A., Pappas, N., and Fleuret, F. Transformers are rnns: Fast autoregressive transformers with linear attention. In International conference on machine learning, pp. 5156 5165. PMLR, 2020.

Keles, F. D., Wijewardena, P. M., and Hegde, C. On the computational complexity of self-attention. In International Conference on Algorithmic Learning Theory, pp. 597 619. PMLR, 2023.

Kitaev, N., Kaiser, Ł., and Levskaya, A. Reformer: The efficient transformer. ar Xiv preprint ar Xiv:2001.04451, 2020.

Kojima, T., Gu, S. S., Reid, M., Matsuo, Y., and Iwasawa, Y. Large language models are zero-shot reasoners. In Advances in Neural Information Processing Systems, 2022.

Liu, B., Ash, J. T., Goel, S., Krishnamurthy, A., and Zhang, C. Transformers learn shortcuts to automata. ar Xiv preprint ar Xiv:2210.10749, 2022.

Liu, B., Ash, J. T., Goel, S., Krishnamurthy, A., and Zhang, C. Transformers learn shortcuts to automata. In The Eleventh International Conference on Learning Representations, 2023.

Loshchilov, I. and Hutter, F. Fixing weight decay regularization in adam. 2017.

Luo, S., Li, S., Cai, T., He, D., Peng, D., Zheng, S., Ke, G., Wang, L., and Liu, T.-Y. Stable, fast and accurate: Kernelized attention with relative positional encoding. In Advances in Neural Information Processing Systems, volume 34, pp. 22795 22807, 2021.

Merrill, W. and Sabharwal, A. The parallelism tradeoff: Limitations of log-precision transformers. Transactions of the Association for Computational Linguistics, 2023.

Merrill, W., Sabharwal, A., and Smith, N. A. Saturated transformers are constant-depth threshold circuits. Transactions of the Association for Computational Linguistics, 10:843 856, 2022.

Nye, M., Andreassen, A. J., Gur-Ari, G., Michalewski, H., Austin, J., Bieber, D., Dohan, D., Lewkowycz, A., Bosma, M., Luan, D., Sutton, C., and Odena, A. Show your work: Scratchpads for intermediate computation with language models. In Deep Learning for Code Workshop, 2022.

Olsson, C., Elhage, N., Nanda, N., Joseph, N., Das Sarma, N., Henighan, T., Mann, B., Askell, A., Bai, Y., Chen, A., Conerly, T., Drain, D., Ganguli, D., Hatfield-Dodds, Z., Hernandez, D., Johnston, S., Jones, A., Kernion, J.,

Lovitt, L., Ndousse, K., Amodei, D., Brown, T., Clark, J., Kaplan, J., Mc Candlish, S., and Olah, C. In-context learning and induction heads. Transformer Circuits Thread, 2022. https://transformer-circuits.pub/2022/in-contextlearning-and-induction-heads/index.html.

Open AI. Gpt-4 technical report. ar Xiv preprint ar Xiv:2303.08774, 2023.

Peng, H., Pappas, N., Yogatama, D., Schwartz, R., Smith, N. A., and Kong, L. Random feature attention. ar Xiv preprint ar Xiv:2103.02143, 2021.

P erez, J., Marinkovi c, J., and Barcel o, P. On the turing completeness of modern neural network architectures. ar Xiv preprint ar Xiv:1901.03429, 2019.

P erez, J., Barcel o, P., and Marinkovic, J. Attention is turing complete. The Journal of Machine Learning Research, 22(1):3463 3497, 2021.

Qiu, J., Ma, H., Levy, O., Yih, W.-t., Wang, S., and Tang, J. Blockwise self-attention for long document understanding. In Findings of the Association for Computational Linguistics: EMNLP 2020, pp. 2555 2565, 2020.

Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., Sutskever, I., et al. Language models are unsupervised multitask learners. Open AI blog, 1(8):9, 2019.

Roy, A., Saffar, M., Vaswani, A., and Grangier, D. Efficient content-based sparse attention with routing transformers. Transactions of the Association for Computational Linguistics, 9:53 68, 2021.

Sanford, C., Hsu, D., and Telgarsky, M. Representational strengths and limitations of transformers. ar Xiv preprint ar Xiv:2306.02896, 2023.

Sun, Y., Dong, L., Huang, S., Ma, S., Xia, Y., Xue, J., Wang, J., and Wei, F. Retentive network: A successor to transformer for large language models (2023). URL http://arxiv. org/abs/2307.08621 v1.

Tay, Y., Bahri, D., Yang, L., Metzler, D., and Juan, D.-C. Sparse sinkhorn attention. In International Conference on Machine Learning, pp. 9438 9447. PMLR, 2020a.

Tay, Y., Dehghani, M., Abnar, S., Shen, Y., Bahri, D., Pham, P., Rao, J., Yang, L., Ruder, S., and Metzler, D. Long range arena: A benchmark for efficient transformers. ar Xiv preprint ar Xiv:2011.04006, 2020b.

Tay, Y., Dehghani, M., Bahri, D., and Metzler, D. Efficient transformers: A survey, 2022.

Touvron, H., Martin, L., Stone, K., Albert, P., Almahairi, A., Babaei, Y., Bashlykov, N., Batra, S., Bhargava, P.,

Do Efficient Transformers Really Save Computation?

Bhosale, S., et al. Llama 2: Open foundation and finetuned chat models. ar Xiv preprint ar Xiv:2307.09288, 2023.

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł., and Polosukhin, I. Attention is all you need. Advances in neural information processing systems, 30, 2017.

Von Oswald, J., Niklasson, E., Randazzo, E., Sacramento, J., Mordvintsev, A., Zhmoginov, A., and Vladymyrov, M. Transformers learn in-context by gradient descent. In International Conference on Machine Learning, pp. 35151 35174. PMLR, 2023.

Vyas, A., Katharopoulos, A., and Fleuret, F. Fast transformers with clustered attention. In Advances in Neural Information Processing Systems, volume 33, pp. 21665 21674, 2020.

Wang, S., Li, B. Z., Khabsa, M., Fang, H., and Ma, H. Linformer: Self-attention with linear complexity. ar Xiv preprint ar Xiv:2006.04768, 2020.

Wei, C., Chen, Y., and Ma, T. Statistically meaningful approximation: a case study on approximating turing machines with transformers. Advances in Neural Information Processing Systems, 35:12071 12083, 2022a.

Wei, J., Wang, X., Schuurmans, D., Bosma, M., Xia, F., Chi, E., Le, Q. V., Zhou, D., et al. Chain-of-thought prompting elicits reasoning in large language models. Advances in Neural Information Processing Systems, 35: 24824 24837, 2022b.

Weiss, G., Goldberg, Y., and Yahav, E. Thinking like transformers. In International Conference on Machine Learning, pp. 11080 11090. PMLR, 2021.

Yao, S., Peng, B., Papadimitriou, C., and Narasimhan, K. Self-attention networks can process bounded hierarchical languages. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pp. 3770 3785, 2021.

Yun, C., Bhojanapalli, S., Rawat, A. S., Reddi, S. J., and Kumar, S. Are transformers universal approximators of sequence-to-sequence functions? ar Xiv preprint ar Xiv:1912.10077, 2019.

Zhou, D., Sch arli, N., Hou, L., Wei, J., Scales, N., Wang, X., Schuurmans, D., Cui, C., Bousquet, O., Le, Q. V., and Chi, E. H. Least-to-most prompting enables complex reasoning in large language models. In The Eleventh International Conference on Learning Representations, 2023.

Do Efficient Transformers Really Save Computation?

A. Technical Lemmas

In this section, we propose some lemmas about the expressive power of the MLP, the linear transformer, and the sparse transformer. Due to the similarity between the efficient and the standard Transformer, we base some ideas and constructions on previous work (Feng et al., 2023).

A.1. Lemmas for MLP

In the previous work (Feng et al., 2023), the authors investigated the expressive power of the MLP with Ge LU activation function. They showed that a two-layer MLP with Ge LU activation and a fixed number of log-precision neurons can perform various tasks such as multiplication, linear transformation, and selection. Building on their work, our proof is more concise and clear. We will restate the relevant lemmas below, and refer to the appendices of Feng et al. (2023) for the proofs.

Lemma A.1 (From Feng et al. (2023)). Let f : R2 R be a two-layer MLP with Ge LU activation, and the hidden dimension is 4. Then, for any ϵ > 0 and M > 0, there exist MLP parameters with ℓ norm upper bounded by O(poly(M, 1/ϵ)) such that |f(a, b) ab| ϵ holds for all a, b [ M, M].

Lemma A.2 (From Feng et al. (2023)). Let g : Rd1 Rd2 be a two-layer MLP with Re LU activation, and all parameter values are upper bounded by M. Then, for any ϵ > 0, there exists a two-layer MLP f of the same size with Ge LU activation and parameters upper bounded by O(poly(M, 1/ϵ)) in the ℓ norm, such that for all x Rd1, we have f(x) g(x) ϵ.

Lemma A.3 (From Feng et al. (2023)). Let f : Rd1 Rd2 be a two-layer MLP with Ge LU activation, and the hidden dimension is 2d2. Let W Rd2 d1 be any matrix and denote M = maxij |Wij|. Then, for any ϵ > 0, there exist MLP parameters with ℓ norm bounded by O(poly(M, 1/ϵ)), such that for any x Rd1, we have f(x) W x ϵ.

Lemma A.4 (From Feng et al. (2023)). Define the selection function g : Rd Rd R Rd as follows:

g(x, y, t) = x if t 0, y if t < 0. (11)

Let f : Rd Rd R Rd be a two-layer MLP with Ge LU activation, and the hidden dimension is 2d + 2. Then, for any ϵ > 0, α > 0, and M > 0, there exist MLP parameters with ℓ norm bounded by O(poly(M, 1/α, 1/ϵ)), such that for all x [ M, M]d, y [ M, M]d, and t [ , α] [α, + ], we have f(x, y, t) g(x, y, t) ϵ.

Lemma A.5. For integer i, define Si = i 1

3 . Let n Z and D = n i=0Si. Define the function g : D Rn as follows: f(x) = mi if x Si, where mi has its first i entries 1 and the rest entries 0. Let f : R Rn be a two-layer MLP with Ge LU activation, and the hidden dimension is 2n. Then, for any ϵ > 0, α > 0, there exists MLP parameters with l norm bounded by O(poly(n, 1/ϵ)), such that for all x D, we have f(x) g(x) ϵ.

Proof. Notice that the k-th entry of mj is

I[k j] = Re LU 2k + 2 j + 3

Re LU 2k + 2 j + 1

Thus each entry of mj can be implemented by an MLP with hidden dimension 2. By Lemma A.2, we can perform both tasks using an MLP with Ge LU activation, with hidden dimension 2n.

Lemma A.6. For integer i, define Si = i 1

3 . Let n Z and D = n i=0Si. Define the function g : D Rn as follows: f(x) = ni if x Si, where ni has its first i 1 entries 0 and the rest entries 1. Let f : R Rn be a two-layer MLP with Ge LU activation, and the hidden dimension is 2n. Then, for any ϵ > 0, α > 0, there exists MLP parameters with l norm bounded by O(poly(n, 1/ϵ)), such that for all x D, we have f(x) g(x) ϵ.

Proof. Similarly to the previous proof, the k-th entry of nj is

I[k j] = Re LU 2k 2 j 3

Re LU 2k 2 j 1

which indicates that each entry of nj can be implemented by an MLP with hidden dimension 2. By Lemma A.2, we can perform both tasks using an MLP with Ge LU activation, with hidden dimension 2n.

Do Efficient Transformers Really Save Computation?

Lemma A.7. For integer i, define Si = i 1

4 . Given positive integer n and integer i [1, n2]. There exists a two-layer MLP f : R R3 with Ge LU activation and hidden dimension O(n), such that for any integer i [1, n2] and x Si, f(x) = ( i/n , i/n 1, n ( i/n n i)). Here, n ( i/n n i) = i mod n if i mod n = 0, and n ( i/n n i) = n if i mod n = 0.

Proof. Since i {1, 2, , n2}, we can get i/n {1, 2, , n}. By the proposition that i is an integer, we can use an MLP with hidden dimension 2n and Re LU activation to calcualte i/n as following:

j=1 I[i jn] =

j=1 Re LU 2i + 2 jn + 3

Re LU 2i + 2 jn + 1

Thus we can calculate i/n , i/n 1, n ( i/n n i) task by an MLP with Re LU activation with hidden dimension O(n). By Lemma A.2, we can finish these tasks by an MLP with Ge LU activation with hidden dimension O(n).

A.2. Lemmas for Linear Transformer

Lemma A.8. The attention module in Linear Transformer

j i ϕ(WKxj) ϕ(WQxi)WV xi P

j i ϕ(WKxj) ϕ(WQxi) (15)

can be implemented as follows:

s0 = 0, z0 = 0 (16)

si = si 1 + ϕ(WKxi)(WV xi) (17)

zi = zi 1 + ϕ(WKxi) (18)

Attn(xi) = ϕ(WQxi) si

ϕ(WQxi) zi (19)

Since this implementation is quite similar to RNNs, we called s, z in our implementation hidden states.

Now, we introduce the AGG operation for Linear Transformer. Let m be an integer and let x1, x2, , xm2 be a sequence of vectors where xi = (ˆxi, e i/m , eqi, 1) [ M, M]d+m+1, ˆxi Rd, and e i/m , eqi are one-hot vectors in Rm, and M is a large constant. Let Si = {j Z : qim (m 1) j min(qim, i)}. Define the AGG operation as follows: The output is a sequence of vectors u1, , um2 with ui = meanj Si ˆxj. The output is undefined when Si = . Lemma A.9. For any 0 < ϵ M, there exists an linear attention layer with hidden dimension O(max(d, m)) and one causal attention head that can approximate the AGG operation defined above. Specifically, suppose the attention output are o1, , om2, then we have oi ui ϵ for all i [m2]. Moreover, the l norm of attention parameters are bounded by O(poly(log(M), log(m), log(1/ϵ)).

Proof. Without loss of generosity, we can assume d = m (otherwise, we can do some padding with 0). We will provide a proof based on the implement of Lemma A.8.

We construct the query, key and value vectors as follows:

Query: qi = µeqi µ

Key: ki = µe i/m µ

Value: vi = ˆxi

where µ > 0 is a constant defined later. This can be achieved by setting appropriate WQ, WK, WV .

Notice that j Si if and only if j i and j/m = qi. By the construction of qi, ki, vi, we can get

aij := q i kj =

( 1 + (m 1) exp( µ), j Si 2 exp( µ) + (m 2) exp( 2µ), j / Si

Do Efficient Transformers Really Save Computation?

By taking µ = ln 2Mm3

ϵ , which is bounded by O(poly(log(M), log(m), log(1/ϵ)), we have

j / Si aij (m2 |Si|)[2 exp( µ) + (m 2) exp( 2µ)] (m2 |Si|)[2 exp( µ) + (m 2) exp( 2µ)] + |Si|[1 + (m 1) exp( µ)]

1 + |Si| m2 |Si| 1+(m 1) exp( µ) 2 exp( µ)+(m 2) exp( 2µ)

|Si| 2 exp( µ) + (m 2) exp( 2µ)

1 + (m 1) exp( µ)

m2 m exp( µ) = m3 exp( µ) = ϵ 2M

Similarly, for j Si, we have aij 1 |Si|

1 |Si| 1 + (m 1) exp( µ) (m2 |Si|)[2 exp( µ) + (m 2) exp( 2µ)] + |Si|[1 + (m 1) exp( µ)]

(m2 |Si|) 2 exp( µ)+(m 2) exp( 2µ)

1+(m 1) exp( µ) + |Si|

= (m2 |Si|) 2 exp( µ)+(m 2) exp( 2µ)

1+(m 1) exp( µ)

|Si| h (m2 |Si|) 2 exp( µ)+(m 2) exp( 2µ)

1+(m 1) exp( µ) + |Si| i

1 |Si|(m2 |Si|) [2 exp( µ) + (m 2) exp( 2µ)]

1 |Si|m2 m exp( µ) = 1 |Si|m3 exp( µ) = ϵ 2M|Si|

We thus obtain

j aij ˆxj 1 |Si|

j / Si aij + X

which concludes our proof.

A.3. Lemmas for Sparse Transformer

The previous work by Feng et al. (2023) studied the expressive power of the standard attention layer and introduced a basic operation, COPY, that can be implemented by the attention layer. In this section, we introduce an updated version of the COPY operation for the sparse transformer and demonstrate that the sparse transformer module is capable of implementing this operation.

Firstly, we will enumerate the elements and their respective notations that will be employed to define the COPY operation.

A sequence of vectors x1, x2, , xn Rd+2, where xi = [vi, ri, i mod B, i

B , 1], xi Rd, ri R, B Z.

Three matrices K, Q, V Rd (d+2), where V 1

The attention sets S i for each i, where S i is the indices of the tokens attended by the i-th token.

Two real number ρ, δ > 0

Denoting that ki = Kxi, qi = Qxi, vi = V xi Rd , we can define the matching set Si for the i as Si = {j < i | qi kj < ρ} S i. Then the output of the COPY operation u1, u2, , un is defined as ui = vpos(i), where pos(i) = argmaxj Si rj.

Do Efficient Transformers Really Save Computation?

Intuitively, the COPY operation copies the embedding of the matching token with the highest rating. Note that the output of the COPY operation can be anything when Si is empty.

Moreover, we make the following regularity assumption: Assumption A.10. The input sequence x1, x2, , xn and the matrices K, Q, V satisfy the following condition:

For any i, j [n], either |qi kj| ρ or qi kj δ

For any i, j [n], either i = j or |ri rj| δ

Assumption A.10 ensures that only one token with the highest rating is matched, and there is a substantial difference between this token and others. The subsequent lemma demonstrates that the sparse transformer module can effectively implement the COPY operation. Lemma A.11 (COPY Operation of Standard Transformer, From Feng et al. (2023)). Given that Assumption A.10 holds with ρ δ2 8M , there exists a standard transformer module defined in Equation (6) with one attention layer, block size B, embedding size O(d), and one attention head that can approximate the COPY operation defined above. Specifically, for any sequence of vectors x1, x2, , xn, denote the corresponding output of the attention layer as o1, o2, , on. Then, we have oi ui ϵ for all i [n] with Si = . Furthermore, the ℓ norm of attention parameters is bounded by O(poly(M, 1/δ, log(n), log(1/ϵ))).

Now, we will prove a special form of the COPY operation for the Sparse Transformer, such that given an index of a token, COPY the embedding of the token with the given index.

Lemma A.12 (COPY Operation of Sparse Transformer). Given that Assumption A.10 holds with ρ δ2 8M , there exists a sparse transformer module defined in Equation (6) with three attention layer, block size B, embedding size O(d), and one attention head that can approximate the COPY operation defined above. Specifically, for any sequence of vectors x1, x2, , xn, denote the corresponding output of the attention layer as o1, o2, , on. Then, we have oi ui ϵ for all i [n] with Si = . Furthermore, the ℓ norm of attention parameters is bounded by O(poly(M, 1/δ, log(n), log(1/ϵ))).

The proof for this version is an extension of the proof for a Lemma described in Feng et al. (2023). For the reader s convenience, we provide a concise proof as follows.

Proof. Layer 1. In our formulation, the input embedding xi has the form xi = (vi, ptarget, i mod B, i

B , 1), where ptarget is the target index of the COPY operation. The primary objective of the first layer is to utilize its MLP to transform the input embedding. The target output of this layer is

[vi,1, vi,2, , vi,B, ptarget, i mod B, i

B , 1], where vi,j =

( 0 , when j = i mod B + 1 vi , when j = i mod B + 1

Note that vi is bound by [ M, M]. Therefore, we can express vi,j as

vi,j = Re LU Re LU(vi + M) M 1 + Re LU(j i mod B) + Re LU(i mod B j)

According to Lemma A.2, an MLP with Ge LU as the activation function can accomplish this task with arbitrarily small error. Therefore, we can write the output of this layer as x(1) i = [vi,1, vi,2, , vi,B, ptarget, i mod B, i

Layer 2. The main task of the second layer is to sum over the embedding of previous B tokens. In this layer, the attention head just pays uniform attention to the previous B tokens. Note that, in the first layer, we distribute the embeddings of different tokens on the different dimensions of the embedding. Therefore, the output of the attention layer is 1 B [v(i B+i mod B+1), v(i B+i mod B+2), , vi, v(i B+i mod B), ptarget, i mod B, i

B , 1]. Then, by using the MLP to multiply the number of tokens B, we can get the embeddings of all tokens. Moreover, in this layer, according to Lemma A.7 we can use the MLP to calculate the index of the block of the target token as ptarget

B . The output of this layer as

x(2) i = [v(i B+i mod B+1), v(i B+i mod B+2), , vi, v(i B+i mod B), ptarget, i mod B, i

B , ptarget

Do Efficient Transformers Really Save Computation?

Layer 3. The main task of the second layer is to COPY the embedding of the token in the corresponding block and select the target embedding. In this layer, the attention head just COPY the embedding of the token in the corresponding block. According to Lemma A.11, the attention layer can complete this task with arbitrarily small errors. Finally, we use the MLP to select the target embedding from the copied embedding and obtain the final output.

B. Proofs of the theorems in Section 4

For ease of reading, we reclaim some important assumptions here, which are natural and mild.

Assumption B.1. Each function f, g, h and u defined in Co T for DP can be approximated by constant size MLP with Ge LU activation.

Assumption B.2. The state transition function F, where the inputs are the token number and current state, and the outputs are the next state, can be approximated with MLP and Ge LU function.

In addition we replace all residual connections in attention layers and MLP with concatenation, which doesn t change the expressive power of model architecture.

B.1. Proof of the Theorem 4.2

In this subsection, we provide a complete proof of Theorem 4.2. For clarity, we restate the theorem for the linear transformer and sparse transformer individually and provide their respective proofs.

Theorem B.3. Consider any DP problem satisfying the same condition as in Theorem 4.1. Given any integer n > 0, let L be the length of the output sequence when the input sequence length is n. Then, there is a (log-precision) linear transformer with a constant depth M, a constant number of attention heads H, and a hidden dimension D = O(

L) that can generate the correct output for all inputs s of length n.

Proof. Denote W =

L . In this formulation, the embeddings are x(0) k = (einput k , estate k , edp k , eanswer k , k, 1), where einput k , estate k , edp k , eanswer k corresponds to input token, DP state, DP value, final result. The embeddings are 0 if it s not defined. We construct the layers as follows.

Block 1. According to Assumption B.2, we can use constant layers of MLPs to obtain enext state k with input estate k . This can be done by setting zero weight matrices in attention layers and discard attention output in linear projection of MLP. The output of this block is x(1) k = (einput k , estate k , enext state k , edp k , k, 1).

Block 2. The second layer of the Transformer finishes the following tasks:

Calculate h(enext state k ), g(enext state k ).

Calculate f state k as the indicator variable of estate k is the last state.

Calculate (h(enext state k ))2, (g(enext state k ))2, (estate k )2, which are element-wise square operations, and k2.

The first task can be implemented by several layers of MLPs by Assumption B.1. To perform the second task, we can check whether estate k = 0 and enext state k = 0. The last task can be done with an MLP using Lemma A.1. The output of this block is

x(2) k = (einput k , estate k , enext state k , edp k , esep k , h(enext state k ), g(enext state k ),

(h(enext state k ))2, (g(enext state k ))2, (estate k )2, f state k , k, k2, 1).

Block 3. The third block of the Transformer uses K + J heads to perform the following tasks where K and J are defined in (9), refers to the transition in DP:

Get inputs of sg1(i), , sg J(i) where i corresponds to enext state k .

Get DP value of dp(h1(i)), , dp(h K(i)) for i corresponds to enext state k .

Calculate enext dp k , i.e., the DP value of the next state, enext state k .

Do Efficient Transformers Really Save Computation?

To perform the first two tasks, we need to first calculate the absolute position of the embeddings we want. This can be implemented by several MLPs by Assumption B.1 and problem size n. Suppose the absolute positions are ˆg1, , ˆg J, ˆh1, , ˆh K. Then we can do the following tasks for t = ˆg1, , ˆg J, ˆh1, , ˆh K:

1. Calculate t/W , W ( t/W W t) by Lemma A.7.

2. Calculate es RW for s = t/W , W ( t/W W t) by Lemma A.5, A.6 and the fact that ei = mi ni.

We also calculate es RW for s = k/W , W ( k/W W k) by Lemma A.5, A.6. Then we can get

(I[W ( k/W W k) = 1] x(2) k , , I[W ( k/W W k) = 1] x(2) k )

by the multiplication between e W ( k/W W k) and x(2) k which can be implemented by an MLP with hidden dimension

O(W). Then, we can get x(2) t for t = ˆg1, , ˆg J, ˆh1, , ˆh K in following steps:

1 W (x(2) t/W W (W 1), , x(2) t/W W ), i t/W W

1 i t/W (W 1)(x(2) t/W W (W 1), , x(2) i , 0, , 0), i < t/W W

for t = ˆg1, , ˆg J, ˆh1, , ˆh K by using Lemma A.9.

(x(2) t/W W (W 1), , x(2) t/W W ), i t/W W

(x(2) t/W W (W 1), , x(2) i , 0, , 0), i < t/W W

for t = ˆg1, , ˆg J, ˆh1, , ˆh K by using Lemma A.1 for multiplication first, and use Lemma A.4 to implement a conditional selection.

3. Get x(2) t for t = ˆg1, , ˆg J, ˆh1, , ˆh K by multiplication between the above vector and e W ( t/W W t), then sum

up them. Then we can get sg(n) 1 (i), , sg(n) J (i); dp(h(n) 1 (i)), , dp(h(n) K (i)) by linear projection.

If the corresponding field is undefined, we can mark them as some special value. This can be done by selection using MLP (Lemma A.4). Assumption B.1 indicates that we can calculate the function f (defined in (9)) using an MLP. The output of this layer is x(3) k = (enext state k , edp k , enext dp k , nk, f state k , k, 1).

Block 4. The fourth block of the autoregressive transformer generates the output based on the flag f state k . We calculate u(edp k ) to get the final result. If f state k = 1, then we select u(edp k ) as the output; otherwise, we prepare and output the DP result for the next state, i.e., enext state k and enext dp k . This is a conditional selection operation and thus can be implemented by an MLP (Lemma A.4).

Theorem B.4. Consider any DP problem satisfying the same condition as in Theorem 4.1. Given any integer n > 0, let L be the length of the output sequence when the input sequence length is n. Then, there is a (log-precision) sparse Transformer with block size B = Θ(

L) with a constant depth M, a constant number of attention heads H, and a hidden dimension D = O(

L) that can generate the correct output for all inputs s of length n.

Proof. The construction of the first two blocks is the same as the linear transformer, and we will give the construction of the third block and the fourth block for the sparse transformer.

Block 3. The third block of the Transformer uses K + J heads to perform the following tasks where K and J are defined in (9), refers to the transition in DP:

Do Efficient Transformers Really Save Computation?

Get inputs of sg1(i), , sg J(i) where i corresponds to enext state k .

Get DP value of dp(h1(i)), , dp(h K(i)) for i corresponds to enext state k .

Calculate enext dp k , i.e., the DP value of the next state, enext state k .

To perform the first two tasks, we need to first calculate the absolute position of the embeddings we want. This can be implemented by several MLPs by Assumption B.1 and problem size n. Supposing the absolute positions are ˆg1, , ˆg J, ˆh1, , ˆh K, we can perform the COPY operation by Lemma A.12, and we can obtain the input sg1(i), , sg J(i) and the DP value of dp(h1(i)), , dp(h K(i)) for i corresponds to enext state k . With the inputs and DP value, according to Assumption B.1, we can calculate the function f (defined in (9)) using an MLP. The output of this layer is

x(3) k = (enext state k , edp k , enext dp k , nk, f state k , k, 1).

Block 4. The fourth block of the sparse transformer generates the output based on the flag f state k . We calculate u(edp k ) to get the final result. If f state k = 1, then we select u(edp k ) as the output; otherwise, we prepare and output the DP result for the next state, i.e., enext state k and enext dp k . This is a conditional selection operation and thus can be implemented by an MLP (Lemma A.4).

B.2. The proof of Theorem 4.4

In this subsection, we provide a complete proof of Theorem 4.2. For clarity, we restate the theorem for the linear transformer and sparse transformer individually and provide their respective full proofs.

Theorem B.5. Consider any regular DP problem satisfying the same condition as in Theorem 4.1. Assume that the output sequence length L is proportional to the input sequence length n, i.e., L = Θ(n). Then, given a sufficiently large n, for (log-precision) linear Transformer, a model with a constant depth M and a constant number of attention heads H can generate the correct output for all inputs s of length n only if the hidden dimension D = Ω(

Proof. By Lemma A.8, we can know the hidden states corresponding to each Linear Transformer layer and the last input token (which is a fixed special token) determines the Co T output. Thus, the size of hidden states should be at least Ω(n) = Ω(L). By the regularity assumption, we know that different input should corresponds to different hidden states. This implies that the hidden dimension should be at least Ω( p

L/ log L) = Ω(

L), concluding our proof.

Theorem B.6. Consider any regular DP problem satisfying the same condition as in Theorem 4.1. Assume that the output sequence length L is proportional to the input sequence length n, i.e., L = Θ(n). Then, given a sufficiently large n, for (log-precision) sparse Transformer with block size B = Θ(

L), a model with a constant depth M and a constant number of attention heads H can generate the correct output for all inputs s of length n only if the hidden dimension D = Ω(

Proof. Assuming D = o(

L ln L), we present a proof by contradiction. When the model generates the output sequence, the model attends to at most Θ(B) tokens in the input sequence, which are the last c tokens of every block and the last B tokens of the input sequence. The overall memory cost of all the hidden embeddings for these tokens is Θ(BD log n) bits. According to Equation (6), given the same hidden embedding of these tokens in every layer, the model will execute the same computation and produce the identical sequence. Thus, the model can produce a maximum of eΘ(BD log n) = eo(L)

output sequence types given the hidden dimension D = o(

L ln L). However, for an input sequence of length L, there are corresponding eΘ(L) input and output sequence types. According to the pigeonhole principle, this implies the existence of two distinct input sequences that the model generates the same output sequence. According to regularity assumption, there is an input sequence that the model generates an incorrect output sequence. Therefore, we have determined that D = Ω(

L ln L) = Ω(

C. Proofs of the theorems in Section 5

In this section, we will prove the theorems and propositions outlined in Section 5. Firstly, we will provide a precise and formal definition of the problems and statements for each theorem and proposition. Subsequently, we will offer a thorough proof for each theorem and proposition.

Do Efficient Transformers Really Save Computation?

This paper utilizes the identical formulation of the arithmetic evaluation task and Co T solution presented by Feng et al. (2023). The arithmetic evaluation task is denoted as Arithmetic(n, p) and is defined on the finite field modulo p, with the input length not exceeding n. The Co T solution can be formally defined as follows: for any arithmetic expression, there must exist a handle, which refers to a pair of adjacent numbers connected by an operator that can be evaluated. In the Co T method, we solve the leftmost variable first in each step and connect the equations of each step using the equal sign.

C.1. Proofs of Proposition 5.2

In this section, we will give a proof of Proposition 5.2. Theorem C.1. For any integer n, a log-precision linear Transformer with a constant depth M and a constant number of heads H can generate the correct output for the arithmetic evaluation task for all expressions of length no more than n only if the hidden dimension D = Ω( 4

Proof. Notice that the generated Co T sequence can be determined by the result of the first step. The different possibilities of results for the first step are at least Ω(exp(n)) since each a1 OP1 a2 are legal where ai Fp and OPi {+, , , }. On the other hand, the hidden state corresponding to the input sequence determines the output Co T sequence. Thus, the size of hidden state should be at least Ω(n) = Ω(

L). This means that the hidden dimension should be at least

L/ log L) = Ω( 4

L), which ends our proof.

C.2. Proofs of Proposition 5.4

In this subsection, we will prove Proposition 5.4, which is a corollary of the theorem from Feng et al. (2023) such that the sparse transformer can generate the correct output for the DP problem. Theorem C.2 (From Feng et al. (2023)). For any DP problem, any integer n N, there exists an autoregressive Transformer with constant depth L, hidden dimension d and attention heads H (independent of n), such that the answer generated by the Transformer is correct for all input sequences s of length no more than n. Moreover, all parameter values are bounded by O(poly(n)). Theorem C.3. Consider any m-locality DP problem satisfying the same condition as in Theorem 4.1. Given any integer n > 0, let L be the length of the output sequence when the input sequence length is n. Then, there exists a (log-precision) sparse Transformer with block size B = Θ(m), a constant depth M, a constant number of attention heads H, and a constant hidden dimension D that can generate the correct output for all inputs s of length n.

Proof. Under the assumption of locality, we can treat the sparse transformer as a standard transformer to solve the m-locality DP problem. When the standard transformer solves the m-locality DP problem, the attention head will only attend to the tokens with the distance at most m, and the attention to other tokens is 0. The sparse transformer adds masks to other tokens, and therefore, is equivalent to the standard transformer for m-locality DP problem. According to Theorem C.3, the sparse transformer can solve m-locality DP problem.

C.3. Proofs of Proposition 5.5

In this section, we will give a proof of Proposition 5.5, which is a corollary of Theorem 4.4. Theorem C.4. Consider any m-locality regular DP problem satisfying the same condition as in Theorem 4.1 and assume that m = Θ(n) where n is the input sequence length. Then, a log-precision linear Transformer with a constant depth M and a constant number of heads H can generate the correct output for all inputs s of length n only if the hidden dimension D = Ω( m).

Proof. Same as the arguments used in the proof of Theorem 4.2, we can get the size of hidden states should be at least Ω(n) = Ω(m). The regularity assumption implies that the hidden dimension should be at least Ω( p

m/ log m) = Ω( m), concluding our proof.

D. Additional Experiments

The accuracies of standard and efficient Transformers with 5 layers on the ED task are shown in Figure 3. The results are similar to the results we obtained when using 3 layers. This evidence further supports our theoretical analysis.

Do Efficient Transformers Really Save Computation?

Figure 3. A comparison of accuracies on different model types with ED task. Each subplot corresponds to a model (Standard Transformer, Linear Transformer, Sparse Transformer). Within each subplot, the x-axis represents the embedding dimension, and the y-axis denotes the problem size. The color intensities indicate the accuracy level achieved by the respective models.