# fast_attention_requires_bounded_entries__90c79b37.pdf

Fast Attention Requires Bounded Entries

Josh Alman Zhao Song

In modern machine learning, inner product attention computation is a fundamental task for training large language models such as Transformer, GPT-1, BERT, GPT-2, GPT-3 and Chat GPT. Formally, in this problem, one is given as input three matrices Q, K, V [ B, B]n d, and the goal is to construct the matrix Att(Q, K, V ) := diag(A1n) 1AV Rn d, where A = exp(QK /d) is the attention matrix , and exp is applied entry-wise. Straightforward methods for this problem explicitly compute the n n attention matrix A, and hence require time Ω(n2) even when d = no(1) is small. In this paper, we investigate whether faster algorithms are possible by implicitly making use of the matrix A. We present two results, showing that there is a sharp transition at B = Θ( log n).

If d = O(log n) and B = o( log n), there is an n1+o(1) time algorithm to approximate Att(Q, K, V ) up to 1/poly(n) additive error. If d = O(log n) and B = Θ( log n), assuming the Strong Exponential Time Hypothesis from fine-grained complexity theory, it is impossible to approximate Att(Q, K, V ) up to 1/poly(n) additive error in truly subquadratic time n2 Ω(1).

This gives a theoretical explanation for the phenomenon observed in practice that attention computation is much more efficient when the input matrices have smaller entries.

1 Introduction

Large language models (LLMs) such as Transformer [VSP+17], BERT [DCLT18], GPT-3 [BMR+20], Pa LM [CND+22], and OPT [ZRG+22] can process natural language more effectively than smaller models or traditional algorithms. This means that they can understand and generate more complex and nuanced language, which can be useful for a variety of tasks such as language translation, question answering, and sentiment analysis. LLMs can also be adapted to multiple purposes without needing to be retained from scratch. Their power is particularly exemplified by the recent success of Chat GPT, a chat software by Open AI built on top of GPT-3 [Ope22].

The key technical backbone of LLMs is the attention matrix [VSP+17, RNS+18, DCLT18, RWC+19, BMR+20]. An attention matrix is a square matrix whose rows and columns correspond to words or tokens , and whose entries correspond to the correlations between these tokens in natural text. The attention matrix is then used to calculate the importance of each input token in a sequence when producing an output. In an attention mechanism, each input token is given a weight or score, which reflects its importance or relevance to the current output being generated. These scores are calculated based on a comparison between the current output state and the input states, using a similarity function.

josh@cs.columbia.edu. Columbia University. zsong@adobe.com. Adobe Research.

37th Conference on Neural Information Processing Systems (Neur IPS 2023).

More formally, the attention matrix is defined as follows. Let Q Rn d be the matrix of query tokens, and K Rn d be the matrix of key tokens. (We focus here on the case when d = no(1), so d n.) The attention matrix is an n n matrix A where the rows and columns correspond to the input tokens in the sequence. Each entry in the matrix represents the attention weight or score between a particular input token (query token Q) and a particular output token (key token K). The diagonal entries of the matrix represent self-attention scores, which measure the importance of each token with respect to itself.

The major bottleneck to speeding up LLM operations (in the case of modeling long sequences with large n) is the time to perform attention matrix computations [VSP+17, RNS+18, DCLT18, RWC+19, BMR+20, WLK+20, KKL20]. These computations ask us to multiply the attention matrix A with another value token matrix V Rn d.

We formally define Attention computation as follows. Throughout this paper, we write exp to denote the entry-wise exponential for matrices. Definition 1.1 (Exact Attention Computation EAtt C(n, d)). Given three matrices Q, K, V Rn d, output the n d matrix Att(Q, K, V ) defined by

Att(Q, K, V ) := D 1AV

where A Rn n and diagonal matrix D Rn n are defined as

A := exp(QK /d), and D := diag(A1n).

The straightforward algorithm for this problem computes the matrix A and then performs the multiplications D 1AV , in time n2+o(1). Since A is an n n matrix with n2 entries, it is impossible to improve on this much while explicitly computing the matrix A. However, the input to the problem is not A, but rather the three matrices Q, K, V which each have only n1+o(1) entries. An algorithm which only implicitly makes use of A, without explicitly computing all its entries, could hope to run in almost linear time!

In this paper, we investigate the possibility of accelerating attention computations in this way. The two main questions we address are:

Q1. When can we perform attention computations in almost linear time n1+o(1)? Q2. When can we prove that subquadratic-time algorithms for attention computations are impossible?

In most LLMs, it suffices to approximately perform attention computations throughout the inference process as long as there are reasonable precision guarantees [CGRS19, KKL20, WLK+20, DKOD20, KVPF20, CDW+21, CDL+22, LWD+23, ZSZ+23]. We therefore focus here on approximate attention computation, which can potentially be performed even faster than exact computation. Mathematically, we define the approximate version of EAtt C as follows. Definition 1.2 (Approximate Attention Computation AAtt C(n, d, B, ϵa)). Let ϵa > 0 and B > 0 be parameters. Given three matrices Q, K, V Rn d, with the guarantees that Q B, K B, and V B, output a matrix T Rn d which is approximately equal to D 1AV , meaning, T D 1AV ϵa. Here, for a matrix M Rn n, we write M := maxi,j |Mi,j|.

Again, the straightforward algorithm for this problem runs in time O(n2d) n2+o(1), but the input size is only O(nd) n1+o(1). Our goal is to investigate when faster algorithms are possible in terms of the parameters d, B, and ϵa.

1.1 Our Results

We focus on the natural setting where d = O(log n) (the setting where we model long sequences) and ϵa = 1/ poly(n) (low enough error so that attention computations over an entire network can be combined). Our main results show that whether or not there is a fast algorithm for AAtt C critically depends on B, the magnitudes of the entries in the input matrices.

We first show a lower bound, that when B Ω( log n), it is impossible to design a truly subquadratic-time algorithm. Our lower bound makes use of the Strong Exponential Time Hypothesis (SETH) [IP01], a popular conjecture [Wil18] from the area of fine-grained complexity regarding the time required to solve k-SAT. (See Section 4 below where we discuss SETH in more detail.)

Theorem 1.3 (Lower bound, informal version of Theorem 4.6). Assuming SETH, for every q > 0, there are constants C, Ca, Cb > 0 such that: there is no O(n2 q) time algorithm for the problem AAtt C(n, d = C log n, B = Cb log n, ϵa = n Ca).

Our second complementary result is a new algorithm, showing that when B < o( log n), the problem can be solved very efficiently, in almost linear time.

Theorem 1.4 (Upper bound, informal version of Theorem 3.8). There is an algorithm (Algorithm 1) that solves AAtt C(n, d = O(log n), B = o( log n), ϵa = 1/ poly(n)) in time n1+o(1).

Our Theorems 1.3 and 1.4 show that the attention computation problem AAtt C exhibits a very tight transition at B = Θ( log n) from almost linear time to trivial quadratic time. When B < o( log n) is smaller, the problem can be solved in almost linear time n1+o(1) in the input size, using our algorithm for Theorem 1.4. When B Ω( log n) is greater, our algorithm from Theorem 1.4 no longer applies, and furthermore our lower bound from Theorem 1.3 shows that it is impossible to solve the problem in truly subquadratic time, no matter what algorithmic techniques one uses (assuming SETH).

It has been observed in LLM implementations in practice that computations are much faster when one assumes that the matrix entries are bounded or can be well-approximated using a small number of bits (see, e.g., [ZBIW19, Section 2] and [KVPF20, Section 3.2.1]). Our work can be viewed as giving a theoretical explanation for this phenomenon, and helping to explain why techniques like quantization [ZBIW19] and low-degree polynomial approximation [KVPF20] have been so effective in practice.

Related Work.

A recent work by Zandieh, Han, Daliri, and Karbasi [ZHDK23] was the first to give an algorithm with provable guarantees for attention approximation. Their algorithm makes use of locality sensitive hashing (LSH) techniques [CKNS20] which, as we will discuss next, is quite different from our algorithm for Theorem 1.4 which uses the polynomial method [ACSS20, AA22].

In the case when d = o(log2 n), they achieve a running time of roughly O(n1.17 d/ϵ2 r), where ϵr is a relative error parameter (which is similar, though not exactly the same, as our ϵa from Definition 1.2). In particular, their algorithm applies for larger d than ours (we require d = O(log n)), but we achieve almost linear time n1+o(1) (whereas their running time is bounded below by Ω(n1.17)), and our algorithm can handle any polynomial error ϵa = 1/ poly(n) (whereas they require ϵr 1/no(1) to not increase the running time by a polynomial factor).

It is natural to wonder whether further improvements are possible by combining our techniques with those of [ZHDK23]. However, our lower bound of Theorem 1.3 shows that our algorithm of Theorem 1.4 is already essentially tight and cannot be substantially improved.

Another recent work by Keles, Wijewardena, and Hedge [KWH23] was the first to prove a lower bound for attention computation assuming SETH. They prove, among other results, that AAtt C cannot be solved in truly subquadratic time in the case when d = ω(log n). Our Theorem 1.3 improves their result to also hold for d = Θ(log n), and to show how the complexity changes with the magnitude of entries B (which is not studied by [KWH23]). As we discuss more shortly, both our lower bound proof and [KWH23] use the high-level technique of [BIS17], although our more fine-grained analysis of the parameters d, B requires a more intricate analysis and the use of other techniques from fine-grained complexity related to approximate nearest neighbor search [Rub18] and the polynomial method [AA22].

1.2 Technique Overview

Our high-level approach is to make use of similarities between attention computation and other computational problems related to Kernel Density Estimation (KDE). Such a relationship was

investigated by recent work [TBY+19, ZHDK23]. In particular, [ZHDK23] was inspired to apply LSH techniques to attention computation because of the prevalence of LSH in KDE algorithms [CS17, BCIS18, CS19, CKNS20]. The main conceptual idea behind our results is that different techniques from the KDE literature, other than LSH, can be modified to apply in this setting and yield tight algoriths and lower bounds.

To design our algorithm for Theorem 1.3, we instead build off of a different line of work on KDE which makes use of the polynomial method in algorithm design . Suppose M Rn n is a matrix, f : R R is a function, and let f(M) denote the matrix one gets by applying f entry-wise to M. The polynomial method is a technique for finding low-rank approximations of f(M). It shows that if M has low rank, and if f can be approximated by a low-degree polynomial, then the matrix f(M) is very close to a low-rank matrix whose low-rank decomposition can be computed efficiently.

To use this to solve AAtt C, we make use of a recent result which bounds the degree required to approximate the exponential function by a polynomial [AA22] in order to find a low-rank approximation of the attention matrix A. Prior work [ACSS20, ACM+20, AA22] applied these polynomials in a similar way to solve the Gaussian KDE problem; our main observation is that by an appropriate rescaling, this approach can be modified to apply to AAtt C as well.

The proof of our lower bound Theorem 1.3 builds off of another line of work on the fine-grained complexity of KDE problems [BIS17, ACSS20, AA22]. The main idea is to give a fine-grained reduction from the well-studied problem of Approximate Nearest Neighbor search ANN. In ANN, one is given as input n vectors of dimension d, and an error parameter ϵ > 0, and the goal is to find a pair of vectors whose distance is at most (1 + ϵ) times the minimum distance between any pair of the vectors. The straightforward algorithm for ANN runs in quadratic time, and it is known that it is impossible to solve ANN in truly subquadratic time assuming SETH [Rub18].

In order to prove our lower bound, we show that AAtt C can be used to solve ANN. The key idea is that, if the matrices Q and K from AAtt C are formed by concatenating the input vectors to the ANN problem, then the nearest neighbor vectors correspond to the largest entries of the attention matrix A. It is not immediately clear that AAtt C can be used to detect large entries of A, since the output is rescaled by the matrix D 1, but we show that this can be overcome with some modifications to the input vectors which approximately balance the rows of A. Prior work [BIS17, ACSS20, AA22] used a very similar approach to give lower bounds for KDE problems, although KDE doesn t involve any rescaling factors.

In Section 2, we introduce relevant notation and tools from prior work. In Section 3, we present and analyze our attention algorithm. In Section 4, we prove our fine-grained attention lower bound. In Section 5, we provide a conclusion for this paper.

2 Preliminaries

We work in the standard real-RAM model and assume arithmetic operations on real numbers can be performed in constant time in our algorithms.

We use Tmat(a, b, c) to denote the time to multiply an a b matrix with another b c matrix. In fact, we will only make use of the straightforward, practical bound Tmat(a, b, c) O(abc). In principle, fast theoretical matrix multiplication algorithms could be used instead to improve this bound and speed up our algorithms here (in exchange for making them less practical). That said, because of our parameter settings3, we will see that faster matrix multiplication could only improve low-order terms in our running times.

For any positive integer, we use [n] to denote set {1, 2, , n}.

For a matrix M, we write M to denote its ℓ norm, i.e., M := maxi,j |Mi,j|. For a matrix M, we use M to denote its transpose.

We use 1n to denote a length-n vector whose entries are all 1s. We use 0n to denote a length-n vector whose entries are all 0s.

3We will make use of Tmat(n, no(1), no(1)), which can be solved straightforwardly in time n1+o(1), and which cannot be solved much faster since it has input size n1+o(1).

For any matrix A Rn n, we use exp(A) Rn n to denote the matrix where exp(A)i,j = exp(Ai,j). In other words, all the exp() operators in this paper are applied entry-wise to matrices. In particular, we will not use matrix exponentials in this paper.

For a vector x Rn, we use x 0 to denote its number of non-zero entries, we use x 1 to denote its ℓ1 norm, i.e., x 1 := Pn i=1 |xi|, and we use x 2 to denote its ℓ2 norm, i.e., x 2 := (Pn i=1 |xi|2)1/2. For a vector x, we use x to denote its transpose.

2.1 Additive Error for Polynomial Approximation

Our algorithm for attention computation will critically make use of a polynomial approximation for the exponential function. In particular, we use the following tight construction from previous work [AA22]. Lemma 2.1 ([AA22]). Let B > 1 and let ϵ (0, 0.1). There is a polynomial P : R R of degree

g := Θ max n log(1/ϵ) log(log(1/ϵ)/B), B o such that for all x [0, B], we have

|P(x) exp(x)| < ϵ.

Moreover, P can be computed efficiently: its coefficients are rational numbers with poly(g)-bit integer numerators and denominators which can be computed in poly(g) time.

2.2 From Additive Error to Relative Error

We note that in our setting, Lemma 2.1 can be used to give a relative error approximation as well: Corollary 2.2. Let B > 1 and let ϵ (0, 0.1). There is a polynomial P : R R of degree g := Θ(max{ log(1/ϵ) log(log(1/ϵ)/B), B}) such that for all x [ B, B], we have

|P(x) exp(x)| < ϵ exp(x).

Proof. By Lemma 2.1, there is a polynomial Q : R R of degree g = Θ({ log(1/ϵ) log(log(1/ϵ)/B), B}) such that, for all y [0, 2B] we have |Q(y) exp(y)| ϵ. Our desired polynomial is the rescaled P(x) := Q(x + B)/ exp(B). Indeed, for any x [ B, B], we have exp(x) exp( B), and so

|P(x) exp(x)| = |Q(x + B)/ exp(B) exp(x)| = |Q(x + B) exp(x + B)|/ exp(B) ϵ/ exp(B) ϵ exp(x),

as desired.

3 Attention Algorithm

In this section, we show how to use polynomial approximations for the exponential function in order to approximately perform attention computations. In Section 3.1, we define the type of low-rank matrix approximation which we will use. In Section 3.2, we show how polynomial approximations can give rise to such low-rank matrix approximations. In Section 3.3, we bound the entries of the matrix QK Rn n (before converting it to the attention matrix) to confirm that our polynomial approximation applies. In Section 3.4, we state our main technique for approximating the attention matrix. In Section 3.5, we show how to control the error propagation from A to the rescaling matrix D. In Section 3.6, we further explain how to control the error propagation from D and A to the resulting attention matrix. Finally, in Section 3.7, we conclude our general algorithm, and in Section 3.8, we appropriately select the parameters to achieve almost linear time.

3.1 Matrix Low-Rank Approximation

Definition 3.1. Let r 1 denote a positive integer. Let ϵ (0, 0.1) denote an accuracy parameter. Given a matrix A Rn n 0 , we say e A Rn n 0 is an (ϵ, r)-approximation of A if

e A = U1 U 2 for some matrices U1, U2 Rn r (i.e., e A has rank at most r), and

| e Ai,j Ai,j| ϵ Ai,j for all (i, j) [n]2.

3.2 From Low Degree Polynomials to Low Rank Matrices

Lemma 3.2. Let M = XY Rn n denote a matrix with X, Y Rn d. Let P(x) denote a degree-g polynomial, and define r = 2(g+d) 2g .

There is an algorithm that runs in O(nrg) time and, given as input the matrix X, Y , constructs matrices U1, U2 Rn r such that P(M) = U1U 2 . (Here, P(M) denotes the entry-wise application of P to M.)

Due to space limitation, we defer the proof of Lemma 3.2 to Appendix A.

3.3 Matrix QK Has Bounded Entries

Lemma 3.3 (Bounded entry). Suppose B 1 and matrices Q, K Rn d have Q B and K B. Then, we have

Proof. For each (i, j) [n] [n], we have

|(QK )i,j| = |

l=1 Qi,l Kj,l|

d Q K d B2,

as desired.

3.4 Key Lemma

Our key lemma shows that, even though the attention matrix A may have full rank, it has a low-rank approximation that is easy to compute:

Lemma 3.4. Suppose Q, K Rn d, with Q B, and K B. Let A := exp(QK /d) Rn n. For accuracy parameter ϵ (0, 1), there is a positive integer g bounded above by

g = O max n log(1/ϵ) log(log(1/ϵ)/B2), B2o ,

and a positive integer r bounded above by

r 2(g + d) 2g

such that: There is a matrix e A Rn n that is an (ϵ, r)-approximation (Definition 3.1) of A Rn n. Furthermore, the matrices U1 and U2 defining e A can be computed in O(n r) time.

Proof. Let M := QK /d. From Lemma 3.3, we know that M B2. Thus, applying Corollary 2.2 (with bound B2 on its entries), there is a degree-g polynomial P such that the matrix e A = P(M) is an (ϵ, r)-approximation to A (See the definition of (ϵ, r)-approximation in Definition 3.1.) We can then compute U1, U2 using Lemma 3.2, which gives the bound

r 2(g + d) 2g

This completes the proof.

3.5 From A to D

Lemma 3.5. Let A Rn n be any matrix whose entires are all positive and ϵA (0, 0.1) be any parameter. Let e A Rn n be any matrix such that, for all (i, j) [n] [n], we have

| e Ai,j Ai,j| ϵA Ai,j.

Define the matrices D, e D Rn n by D = diag(A1n) and e D = diag( e A1n). Then, for all i [n] we have

| e Di,i Di,i| ϵA Di,i.

Due to space limitation, we defer the proof of Lemma 3.5 into Appendix A.

3.6 From A and D to Attention Matrix

Lemma 3.6. Let ϵA, ϵD (0, 0.1) and B > 1 be parameters, and let V Rn d denote a matrix with V B. Let A Rn n be any matrix whose entires are all positive, and let e A Rn n be a matrix such that, for all (i, j) [n] [n] we have

| e Ai,j Ai,j| ϵA Ai,j.

Let D, e D Rn n be any diagonal matrices with positive entries on their diagonals, with the property that, for all i [n], we have

| e Di,i Di,i| ϵD Di,i.

Then, we have

e D 1 e AV D 1AV (ϵA + ϵD) B.

Due to space limitation, we delay the proof of Lemma 3.6 to Appendix A.

3.7 Main Upper Bound

Theorem 3.7. For positive integers n, d and real parameters ϵ > 0 and B > 1, there are positive integers g = Θ(max{ log(1/ϵ) log(log(1/ϵ)/B2), B2}) and r = 2(g+d) 2d such that: There is an algorithm (Algorithm 1) that runs in O(Tmat(n, r, d) + nrg) time to solve AAtt C(n, d, B, ϵ) (Definition 1.2).

Proof. The running time of each step is shown in Algorithm 1; its running time follows from Lemma 3.4. Its correctness follows from Lemma 3.5 and Lemma 3.6.

Algorithm 1 Our Algorithm

1: procedure POLYATTENTION(Q Rn d, K Rn d, V Rn d, n, d, B, ϵ) Theorem 1.4 2: ϵ is the accuracy output 3: Q , K , V B

4: g O(max{ log(1/ϵ) log(log(1/ϵ)/B2), B2})

5: r 2(g+d) 2d

6: Construct U1, U2 Rn r via Lemma 3.4 O(nrg) time 7: ew U1 (U 2 1n) O(nr) time 8: e D 1 = diag( ew 1) O(n) time 9: Compute U 2 V Rr d Takes Tmat(r, n, d) time 10: Compute U1 (U 2 V ) Tmat(n, r, d) time 11: T e D 1 (U1 (U 2 V )) O(nd) time 12: return T T Rn d

13: end procedure

3.8 Proof of Theorem 1.4

Theorem 3.8 (Upper bound, formal statement of Theorem 1.4). AAtt C(n, d = O(log n), B = o( log n), ϵa = 1/ poly(n)) can be solved in time Tmat(n, no(1), d) = n1+o(1).

Proof. If we select the parameters

log n), ϵ = 1/ poly(n), d = O(log n)

in Theorem 3.7, then we see that

g = O(max{ log(1/ϵ) log(log(1/ϵ)/B2), B2})

= O(max{ log(n) log(log(n)/B2), B2})

= O(max{ log n log log n, o(log n)})

= o(log n),

where the second step follows from ϵ = 1/ poly(n) and the third step follows from B = o( log n).

Since g = o(log n), let us write g = (log n)/f for some f = ω(1). We thus have that

r = 2(d + g) 2g

2g 2O(g log((log n)/g)) 2O(log n log(f)/f) < 2o(log n) < no(1).

The second step follows from the generic bound a b (ea/b)b for 1 b a, and the third step uses that d = O(log n).

Since d, r, g are all bounded by no(1), our final running time is n1+o(1) as desired.

In this section, we prove our fine-grained lower bound for attention computation. In Section 4.1, we state the Strong Exponential Time Hypothesis (SETH), the main hardness assumption we will use. In Section 4.2, we define the approximate nearest neighbor search problem, and its known hardness assuming SETH. Finally, in Section 4.3, we give a reduction from approximate nearest neighbor search to attention computation, which implies our hardness result.

4.1 Fine-Grained Hypotheses

The Strong Exponential Time Hypothesis (SETH) was introduced by Impagliazzo and Paturi [IP01] over 20 years ago. It is a strengthening of the P = NP conjecture, which asserts that our current best SAT algorithms are roughly optimal:

Hypothesis 4.1 (Strong Exponential Time Hypothesis (SETH)). For every ϵ > 0 there is a positive integer k 3 such that k-SAT on formulas with n variables cannot be solved in O(2(1 ϵ)n) time, even by a randomized algorithm.

SETH is a popular conjecture which has been used to prove fine-grained lower bounds for a wide variety algorithmic problems. See, for instance, the survey [Wil18].

4.2 Nearest Neighbor Search

We will make use of a known relationship between SETH and approximate nearest neighbor search. Definition 4.2 (Approximate Hamming Nearest Neighbor Search (ANN)). For a parameter ϵ > 0, in the (1 + ϵ)-Approximate Hamming Nearest Neighbor Search problem for n vectors of dimension d, we are given as input two sets A, B {0, 1}d with |A| = |B| = n, and our goal is to find an a A and b B satisfying a b 0 (1 + ϵ) mina A,b B a b 0.

(This is sometimes called the bichromatic ANN problem, and a monochromatic version has also been studied; see, for instance, [SM19].) Rubinstein [Rub18] showed that for certain parameters, it is impossible to substantially improve on the straightforward quadratic-time algorithm for ANN assuming SETH: Lemma 4.3 ([Rub18]). Assuming SETH, for every q > 0, there are ϵ (0, 1) and C > 0 such that (1 + ϵ)-Approximate Hamming Nearest Neighbor Search in dimension d = C log n requires Ω(n2 q) time. Remark 4.4. We may assume that 4.3 holds even in the special case where each input vector from A and B has half its entries equal to 0 and half equal to 1. Indeed, for any vector a {0, 1}d, we can construct a new vector ea {0, 1}2d given by ea = a a . Here a {0, 1}d is the binary complement of vector a, i.e., ai = 1 ai for all i [d]. Thus, ea 0 = d. We can similarly construct a new vector eb {0, 1}2d for each b B. After this transformation, for any a A and b B, we have ea eb 0 = 2 a b 0, so it suffices to find an approximate nearest neighbor among these transformed vectors.

For convenience in our the analysis, we define a gap version of approximate nearest neighbor search problem Gap ANN(n, d, t, ϵ). Definition 4.5 (Gap approximate nearest neighbor search (Gap ANN(n, d, t, ϵ))). Let n, d denote two positive integers. Let t > 0 denote a threshold parameter. Let ϵ denote a accuracy parameter. Given two sets of points A = {a1, , an} {0, 1}d and B = {b1, , an} {0, 1}d: For each i [n], we need to distinguish the following two cases

Case 1. There exists a j [n] such that ai bj 0 < t.

Case 2. For all j [n] we have ai bj 2 2 (1 + ϵ) t.

An algorithm for Gap ANN(n, d, t, ϵ) can be called log(nd) times to binary search for the answer to ANN, so Lemma 4.3 holds as well for Gap ANN(n, d, t, ϵ).

4.3 Hardness Result

In the remainder of this section, we prove our lower bound for attention computation: Theorem 4.6 (Main Result, formal version of Theorem 1.3). Assuming SETH, for every sufficiently small q > 0, there are constants C > 0 and Cα > 0 and Cβ > 1 such that Approximate Attention Computation AAtt C (Definition 1.2) for parameters (n, d = C log n, B = Cβ log n, ϵa = n Cα) requires Ω(n2 q) time.

Proof. This follows from combining Lemma 4.3 (hardness for approximation nearest neighbor search) and Lemma 4.7 (a reduction from approximate nearest neighbor search to approximate attention computation) which we prove below.

Lemma 4.7. For any constant Cγ (0, 0.1): For every ϵ > 0 and C > 0, there exist constants Ca > 0 and Cb > 0 and such that, if AAtt C (Definition 1.2) for parameters (2n, d = 2C log n, B = Cb log n, ϵa = n Ca) can be solved in time T, then Gap ANN(n, d = C log n, t, ϵ) (Definition 4.5) can be solved in time O(T + n2 Cγ).

Due to space limitation, we defer the proof of Lemma 4.7 to Appendix B.

5 Conclusion

In this work, we showed that how quickly one can perform attention computation depends critically on the magnitude, B, of the entries of the input matrices. Our main idea was to make use of similarities between attention computation and KDE, and to show how many known techniques for KDE can also be used in this setting. Since KDE is a very well-studied problem, it would be exciting to see what other techniques can be applied to attention computation as well. One limitation of our lower bound result is, it is a conditional lower bound which is based on a well-known conjecture SETH in the area of complexity. It would be interesting to show unconditional lower bound for future work. As far as we are aware, our work does not have negative societal impacts.

Acknowledgements Josh Alman was partly supported by a grant from the Simons Foundation (Grant Number 825870 JA). The authors would like to thank Beidi Chen for helpful discussions related to LLMs, and Feyza Duman, Chinmay Hegde, and Piotr Indyk for helpful comments on an earlier draft. The authors would like to appreciate very constructable feedbacks for Neur IPS 2023 Reviewers. The authors would like to thanks Lichen Zhang and Ruizhe Zhang for useful and helpful suggestions about proof-reading the paper.

[AA22] Amol Aggarwal and Josh Alman. Optimal-degree polynomial approximations for exponentials and gaussian kernel density estimation. In 37th Computational Complexity Conference (CCC 2022). Schloss Dagstuhl-Leibniz-Zentrum f ur Informatik, 2022.

[ACM+20] Josh Alman, Timothy Chu, Gary Miller, Shyam Narayanan, Mark Sellke, and Zhao Song. Metric transforms and low rank matrices via representation theory of the real hyperrectangle. ar Xiv preprint ar Xiv:2011.11503, 2020.

[ACSS20] Josh Alman, Timothy Chu, Aaron Schild, and Zhao Song. Algorithms and hardness for linear algebra on geometric graphs. In 2020 IEEE 61st Annual Symposium on Foundations of Computer Science (FOCS), pages 541 552. IEEE, 2020.

[AS23] Josh Alman and Zhao Song. How to capture higher-order correlations? generalizing matrix softmax attention to kronecker computation. ar Xiv preprint ar Xiv:2310.04064, 2023.

[BCIS18] Arturs Backurs, Moses Charikar, Piotr Indyk, and Paris Siminelakis. Efficient density evaluation for smooth kernels. In 2018 IEEE 59th Annual Symposium on Foundations of Computer Science (FOCS), pages 615 626. IEEE, 2018.

[BIS17] Arturs Backurs, Piotr Indyk, and Ludwig Schmidt. On the fine-grained complexity of empirical risk minimization: Kernel methods and neural networks. Advances in Neural Information Processing Systems, 30, 2017.

[BMR+20] Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. Advances in neural information processing systems (Neur IPS), 33:1877 1901, 2020.

[CDL+22] Beidi Chen, Tri Dao, Kaizhao Liang, Jiaming Yang, Zhao Song, Atri Rudra, and Christopher Re. Pixelated butterfly: Simple and efficient sparse training for neural network models. In International Conference on Learning Representations (ICLR), 2022.

[CDW+21] Beidi Chen, Tri Dao, Eric Winsor, Zhao Song, Atri Rudra, and Christopher R e. Scatterbrain: Unifying sparse and low-rank attention. Advances in Neural Information Processing Systems (Neur IPS), 34:17413 17426, 2021.

[CGRS19] Rewon Child, Scott Gray, Alec Radford, and Ilya Sutskever. Generating long sequences with sparse transformers. ar Xiv preprint ar Xiv:1904.10509, 2019.

[CKNS20] Moses Charikar, Michael Kapralov, Navid Nouri, and Paris Siminelakis. Kernel density estimation through density constrained near neighbor search. In 2020 IEEE 61st Annual Symposium on Foundations of Computer Science (FOCS), pages 172 183. IEEE, 2020.

[CND+22] Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Sebastian Gehrmann, et al. Palm: Scaling language modeling with pathways. ar Xiv preprint ar Xiv:2204.02311, 2022.

[CS17] Moses Charikar and Paris Siminelakis. Hashing-based-estimators for kernel density in high dimensions. In 2017 IEEE 58th Annual Symposium on Foundations of Computer Science (FOCS), pages 1032 1043. IEEE, 2017.

[CS19] Moses Charikar and Paris Siminelakis. Multi-resolution hashing for fast pairwise summations. In 2019 IEEE 60th Annual Symposium on Foundations of Computer Science (FOCS), pages 769 792. IEEE, 2019.

[DCLT18] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pretraining of deep bidirectional transformers for language understanding. ar Xiv preprint ar Xiv:1810.04805, 2018.

[DKOD20] Giannis Daras, Nikita Kitaev, Augustus Odena, and Alexandros G Dimakis. Smyrfefficient attention using asymmetric clustering. Advances in Neural Information Processing Systems (Neur IPS), 33:6476 6489, 2020.

[DLS23] Yichuan Deng, Zhihang Li, and Zhao Song. Attention scheme inspired softmax regression. ar Xiv preprint ar Xiv:2304.10411, 2023.

[DSZ23] Yichuan Deng, Zhao Song, and Tianyi Zhou. Superiority of softmax: Unveiling the performance edge over linear attention. ar Xiv preprint ar Xiv:2310.11685, 2023.

[GSWY23] Yeqi Gao, Zhao Song, Weixin Wang, and Junze Yin. A fast optimization view: Reformulating single layer attention in llm based on tensor and svm trick, and solving it in matrix multiplication time. ar Xiv preprint ar Xiv:2309.07418, 2023.

[GSY23] Yeqi Gao, Zhao Song, and Xin Yang. Differentially private attention computation. ar Xiv preprint ar Xiv:2305.04701, 2023.

[GSYZ23] Yeqi Gao, Zhao Song, Xin Yang, and Ruizhe Zhang. Fast quantum algorithm for attention computation. ar Xiv preprint ar Xiv:2307.08045, 2023.

[HJK+23] Insu Han, Rajesh Jarayam, Amin Karbasi, Vahab Mirrokni, David P Woodruff, and Amir Zandieh. Hyperattention: Long-context attention in near-linear time. ar Xiv preprint ar Xiv:2310.05869, 2023.

[IP01] Russell Impagliazzo and Ramamohan Paturi. On the complexity of k-sat. Journal of Computer and System Sciences, 62(2):367 375, 2001.

[KKL20] Nikita Kitaev, Łukasz Kaiser, and Anselm Levskaya. Reformer: The efficient transformer. ar Xiv preprint ar Xiv:2001.04451, 2020.

[KMZ23] Praneeth Kacham, Vahab Mirrokni, and Peilin Zhong. Polysketchformer: Fast transformers via sketches for polynomial kernels. ar Xiv preprint ar Xiv:2310.01655, 2023.

[KVPF20] Angelos Katharopoulos, Apoorv Vyas, Nikolaos Pappas, and Franc ois Fleuret. Transformers are rnns: Fast autoregressive transformers with linear attention. In International Conference on Machine Learning (ICLR), pages 5156 5165. PMLR, 2020.

[KWH23] Feyza Duman Keles, Pruthuvi Mahesakya Wijewardena, and Chinmay Hegde. On the computational complexity of self-attention. In International Conference on Algorithmic Learning Theory, pages 597 619. PMLR, 2023.

[LWD+23] Zichang Liu, Jue Wang, Tri Dao, Tianyi Zhou, Binhang Yuan, Zhao Song, Anshumali Shrivastava, Ce Zhang, Yuandong Tian, Christopher Re, et al. Deja vu: Contextual sparsity for efficient llms at inference time. In International Conference on Machine Learning, pages 22137 22176. PMLR, 2023.

[Ope22] Open AI. Chatgpt: Optimizing language models for dialogue. In Open AI Blog. https://openai.com/blog/chatgpt/, Nov 2022.

[RNS+18] Alec Radford, Karthik Narasimhan, Tim Salimans, Ilya Sutskever, et al. Improving language understanding by generative pre-training. 2018.

[Rub18] Aviad Rubinstein. Hardness of approximate nearest neighbor search. In Proceedings of the 50th annual ACM SIGACT symposium on theory of computing (STOC), pages 1260 1268, 2018.

[RWC+19] Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever, et al. Language models are unsupervised multitask learners. Open AI blog, 1(8):9, 2019.

[SM19] Karthik C. S. and Pasin Manurangsi. On closest pair in euclidean metric: Monochromatic is as hard as bichromatic. 10th Innovations in Theoretical Computer Science (ITCS), 2019.

[TBY+19] Yao-Hung Hubert Tsai, Shaojie Bai, Makoto Yamada, Louis-Philippe Morency, and Ruslan Salakhutdinov. Transformer dissection: An unified understanding for transformer s attention via the lens of kernel. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 4344 4353, 2019.

[VSP+17] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. Advances in neural information processing systems (Neur IPS), 30, 2017.

[Wil18] Virginia Vassilevska Williams. On some fine-grained questions in algorithms and complexity. In Proceedings of the international congress of mathematicians: Rio de janeiro 2018, pages 3447 3487. World Scientific, 2018.

[WLK+20] Sinong Wang, Belinda Z Li, Madian Khabsa, Han Fang, and Hao Ma. Linformer: Self-attention with linear complexity. ar Xiv preprint ar Xiv:2006.04768, 2020.

[ZBIW19] Ofir Zafrir, Guy Boudoukh, Peter Izsak, and Moshe Wasserblat. Q8bert: Quantized 8bit bert. In 2019 Fifth Workshop on Energy Efficient Machine Learning and Cognitive Computing-Neur IPS Edition (EMC2-NIPS), pages 36 39. IEEE, 2019.

[ZHDK23] Amir Zandieh, Insu Han, Majid Daliri, and Amin Karbasi. Kdeformer: Accelerating transformers via kernel density estimation. In ICML. ar Xiv preprint ar Xiv:2302.02451, 2023.

[ZRG+22] Susan Zhang, Stephen Roller, Naman Goyal, Mikel Artetxe, Moya Chen, Shuohui Chen, Christopher Dewan, Mona Diab, Xian Li, Xi Victoria Lin, et al. Opt: Open pre-trained transformer language models. ar Xiv preprint ar Xiv:2205.01068, 2022.

[ZSZ+23] Zhenyu Zhang, Ying Sheng, Tianyi Zhou, Tianlong Chen, Lianmin Zheng, Ruisi Cai, Zhao Song, Yuandong Tian, Christopher R e, Clark Barrett, Zhangyang Wang, and Beidi Chen. H 2 o: Heavy-hitter oracle for efficient generative inference of large language models. In Neur IPS. ar Xiv preprint ar Xiv:2306.14048, 2023.

We provide the missing proofs of our upper bound (algorithm) result in Section A. We provide the missing proofs of our lower bound (hardness) result in Section B.

A Missing Proofs for Upper Bound

A.1 Proof of Lemma 3.2

Lemma A.1 (Restatement of Lemma 3.2). Let M = XY Rn n denote a matrix with X, Y Rn d. Let P(x) denote a degree-g polynomial, and define r = 2(g+d) 2g .

There is an algorithm that runs in O(nrg) time and, given as input the matrix X, Y , constructs matrices U1, U2 Rn r such that P(M) = U1U 2 . (Here, P(M) denotes the entry-wise application of P to M.)

Proof. Let P(x) denote the degree-g polynomial. Expand it in terms of its coefficients as

Consider the function K : Rd Rd R defined by, for u, v Rd,

K(u, v) := P( u, v ).

K is a degree-2g polynomial in the 2d entries u1, ud, v1, , vd of the vectors u, v. Define the set V of its variables,

V := {u1, , ud, v1, , vd}.

Let F denote the set of functions

f : V {0, 1, 2, , 2g} | X

v V f(v) 2g

We can count that

|F| = 2d + 2g 2g

Hence, there are coefficients ct R for each t F such that

K(u, v) = X

Vu := {u1, , ud}

Vv = V \Vu.

We define ϕu : Rd R|F| by, for any t F,

ϕu(u1, , ud)t = ct Y

vi Vu ut(ui) i .

Similarly, we define ϕv : Rd R|F| by, for any t F,

ϕv(v1, , vd)t = Y

vi Vv vt(ui) i .

Thus, we have

K(u, v) = ϕu(u), ϕv(v) .

For i [n], let Xi Rd denote the i-th row of X, and let Yi Rd denote the i-th row of Y . Our algorithm can thus construct

the matrix U1 Rn |F| whose i-th row is the vector ϕu(xi) for i [n], and

the matrix U2 Rn |F| whose i-th row is the vectors ϕv(yi) for i [n].

Each entry of these matrices can be constructed by multiplying together at most g variables, so these n r matrices can be constructed in time O(nrg) as desired.

A.2 Proof of Lemma 3.5

Lemma A.2 (Restatement of Lemma 3.5). Let A Rn n be any matrix whose entires are all positive and ϵA (0, 0.1) be any parameter. Let e A Rn n be any matrix such that, for all (i, j) [n] [n], we have

| e Ai,j Ai,j| ϵA Ai,j.

Define the matrices D, e D Rn n by D = diag(A1n) and e D = diag( e A1n). Then, for all i [n] we have

| e Di,i Di,i| ϵA Di,i.

Proof. We calculate that

| e Di,i Di,i| = |

j=1 e Ai,j X

j=1 | e Ai,j Ai,j|

where the second step follows from triangle inequality.

This completes the proof.

A.3 Proof of Lemma 3.6

Lemma A.3 (Restatement of Lemma 3.6). Let ϵA, ϵD (0, 0.1) and B > 1 be parameters, and let V Rn d denote a matrix with V B. Let A Rn n be any matrix whose entires are all positive, and let e A Rn n be a matrix such that, for all (i, j) [n] [n] we have

| e Ai,j Ai,j| ϵA Ai,j.

Let D, e D Rn n be any diagonal matrices with positive entries on their diagonals, with the property that, for all i [n], we have

| e Di,i Di,i| ϵD Di,i.

Then, we have

e D 1 e AV D 1AV (ϵA + ϵD) B.

Proof. We have

e D 1 e AV D 1AV e D 1 e AV D 1 e AV + D 1 e AV D 1AV . (1)

We now bound each of these two terms separately.

First, for each (i, j) [n] [d],

|( e D 1 e AV D 1 e AV )i,j| = |

l=1 ( e D 1 i,i D 1 i,i ) e Ai,l Vl,j|

l=1 |( e D 1 i,i D 1 i,i ) e Ai,l| V

l=1 |Di,i e Di,i

Di,i e Di,i e Ai,l| V

l=1 | e D 1 i e Ai,l| V

l=1 e D 1 i e Ai,l| V

= ϵD V ϵD B (2)

where the second step follows from the triangle inequality, the forth step follows from |(Di,i e Di,i)/Di,i| ϵD, the fifth step follows from e D 1 i > 0 and e Ai,l > 0, and the last step follows from our assumption on V .

Second, for each (i, j) [n] [d],

|(D 1 e AV D 1AV )i,j| = |

l=1 D 1 i,i ( e Ai,l Ai,l) Vl,j|

l=1 |D 1 i,i | |( e Ai,l Ai,l)| V

l=1 D 1 i,i |( e Ai,l Ai,l)| V

l=1 D 1 i,i ϵAAi,l B

= ϵA B, (3)

where the second step follows from triangle inequality, the third step follows from D 1 i,i > 0, the forth step follows from | e Ai,l Ai,l| ϵA Ai,l and the last step follows from definition of Di,i.

The result follows by combining Eq. (1), and two inequalities (Eq. (2) and Eq. (3)).

B Missing Proofs for Lower Bound

Lemma B.1 (Restatement of Lemma 4.7). For any constant Cγ (0, 0.1): For every ϵ > 0 and C > 0, there exist constants Ca > 0 and Cb > 0 and such that, if AAtt C (Definition 1.2) for parameters (2n, d = 2C log n, B = Cb log n, ϵa = n Ca) can be solved in time T, then Gap ANN(n, d = C log n, t, ϵ) (Definition 4.5) can be solved in time O(T + n2 Cγ).

Proof. We give an algorithm with the stated running time for Gap ANN(n, d = C log n, t, ϵ). Let c > 0 be a parameter we will choose later (it will be a function of C and Cγ). Our algorithm will proceed to one of two cases depending on the value of t. If t < c log n, then we will use one algorithm

which runs in time O(n2 Cγ). Otherwise, if t c log n, we will use another algorithm which runs in time O(T).

Case 1: t < c log n.

Let a1, , an, b1, , bn {0, 1}d be the input vectors to Gap ANN, and let t [0, d] denote the target distance. Recall that d = C log n.

In this t < c log n case, we will simply brute-force for the answer in the following way: We first store the vectors b1, , bn in a lookup table, then for each i [n], we iterate over all vectors b {0, 1}d which have Hamming distance at most t from ai and check whether b is in the lookup table. This determines whether there is a b B at distance at most t from ai, as desired.

For each i [n], we need to iterate over d t choices for the vector b , so the total running time will be O(n d t ). By standard bounds on binomial coefficients, we know that

n C log n c log n

for some function f : R>0 R>0 R>0 with the property that, for any fixed C > 0, we have

lim c 0 f(C, c) = 0.

We can thus pick a sufficiently small constant c > 0, depending only on Cγ and C such that f(C, c) < 1 Cγ and this entire brute-force takes O(n2 Cγ) time.

Case 2: t c log n.

Let a1, , an, b1, , bn {0, 1}d denote the input of Gap ANN(n, d, t, ϵ) (Definition 4.5), and recall from Remark 4.4 that we may assume each has half its entries 0 and half its entries 1. We will explain how to construct an Attention matrix using this instance.

Let C0 c be such that

t := C0 log n. (4)

Let β > 0 and ed d denote parameters we will choose later (see Eq. (9) and Eq. (6), respectively). Define τ > 0 by

τ := exp(β/2). (5)

Intuitively, our goal in picking these parameters is that τ will be an upper bound on entries of the attention matrix, i.e., we will have:

τ max i [n],j [n] exp(β ai, bj /ed).

We will make use of an algorithm for the AAtt C(en, ed, B, ϵa) problem, for the following parameters:

en := 2n, ed := 2d, (6)

log n, where Cb := p

40C/(C0ϵ), (7)

ϵa := n Ca, where Ca := 2 + C2 b (1 + C0/C). (8)

Furthermore, set

β := B2. (9)

We define Q Ren e d and K Ren e d as

a 1 1 d a 2 1 d ... ... a n 1 d 0 d 1 d 0 d 1 d ... ... 0 d 1 d

b 1 0 d b 2 0 d ... ... b n 0 d 0 d 1 d 0 d 1 d ... ... 0 d 1 d

Since each entry of Q and K is either β or 0, it follows that

QK /ed β ed

ed = β = B2.

Using the above construction of Q Ren e d and K Ren e d, we note that

A := exp(QK /ed) Ren en

is given by

exp(β a1, b1 /ed) exp(β a1, b2 /ed) exp(β a1, bn /ed) τ τ τ exp(β a2, b1 /ed) exp(β a2, b2 /ed) exp(β a2, bn /ed) τ τ τ ... ... ... ... ... ... ... ... exp(β an, b1 /ed) exp(β an, b2 /ed) exp(β an, bn /ed) τ τ τ exp(0) exp(0) exp(0) τ τ τ exp(0) exp(0) exp(0) τ τ τ ... ... ... ... ... ... ... ... exp(0) exp(0) exp(0) τ τ τ

(Note that we do not explicitly compute all the entries of A in our algorithm; we will make use of it only through calling our algorithm for the Attention problem.)

For each (i, j) [n] [n], we know that

Ai,j = exp(β ai, bj /ed)

exp(β ai bj d/ed) exp(β) = τ (10)

where the first step follows from definition of A Ren en, the second step follows from ai, bj ai bj d, the third step follows from d < ed (see Eq. (6)), and the last step follows from definition of τ (see Eq. (5)).

On the other hand, we know that that for each (i, j) [n] [n],

Ai,j 0 (11)

since it is an exponential of an entry of QK /ed.

Using Eq. (10) and Eq. (11), combined with our expression for A, it thus follows that

nτ (A1en)i 2nτ, i [en].

Since Di,i = (A1en)i, thus we know that

nτ Di,i 2nτ, i [en].

Choose the vector v Ren defined as

We define et as

3 exp(0.25β(1 t/d))/(2nτ). (12)

We can show that et ϵa as follows:

6n exp(0.25β(1 t/d) β)

6n exp( 0.75β 0.25βt/d)

6n exp( 0.75β 0.25βC0/C)

6 exp( 0.75β 0.25βC0/C log n)

6 exp( 0.75C2 b log n 0.25C2 b (C0/C) log n log n)

where the second step follows from simple algebra, the third step follows from t = C0 log n (Eq. (4)) and d = C log n (assumption in Lemma statement), the second step follows from choice of β (Eq. (7)), and the sixth step follows from choice of Ca (Eq. (8)), and the last step follows from Eq. (8).

Since et ϵa, if we run an algorithm for Approximation Attention Computation (Definition 1.2) AAtt C(en, ed, B, ϵa), where we pick V to be a matrix with one row v and the rest 0, we can output a vector u Ren such that, for all i [en],

|ui (D 1Av)i| < et.

Note that using Remark 4.4, we have

ai 2 2/d = 0.5, i [n],

bj 2 2/d = 0.5, j [n].

Therefore, for any (i, j) [n] [n],

1 d ai, bj = 1

2d( ai 2 2 + bj 2 2 ai bj 2 2)

2d(0.5d + 0.5d ai bj 2 2)

= 0.5 0.5 ai bj 2 2/d,

where the second step follows from ai 2 2 = bj 2 2 = d/2, and the last step follows from simple algebra.

Recall that our goal is to determine, for each i [n], whether there is a j [n] such that ai bj 2 2 t, or whether ai bj 2 2 (1 + ϵa)t for all j [n]. We will show next that we can distinguish these two cases by seeing whether ui is greater than or less than the value et0 := 2et.

If there exists an (i, j) [n] [n] such that ai bj 2 2 t, then

β ai, bj /ed = 0.5 β ai, bj /d 0.25 β(1 t/d),

where the first step follows from 2d = ed (see Eq. (6)).

This means that

ui exp(0.25β(1 t/d))/(2nτ) et

where the second step follows from the definition of et (see Eq. (12)), and the last step follows from the definition of et0.

If for all (i, j) [n] [n], we have ai bj 2 2 > t(1 + ϵ), this implies

β ai, bj /d 0.25β (1 t(1 + ϵ)/d).

Then, for all i [n],

ui < (n exp(0.25β(1 (1 + ϵ)t/d)))/(nτ) + et

= exp(0.25β(1 t/d))/(2nτ) (n/ exp(0.25βϵt/d)) + et

= 3et (n/ exp(0.25βϵt/d)) + et

where the third step follows from definition of et (see Eq. (12)), the forth step follows from the calculation in Eq. (13) below, and the last step follows from et = et0/2.

Finally, b our choice of β and t, we can see that

exp(0.25βϵt/d) = exp((0.25βϵC0 log n)/d) = exp(0.25βϵC0/C) = exp(10 log n) > 4n, (13)

where the first step follows t = C0 log n (Eq. (4)), the second step follows from d = C log n, and the third step follows from β = B2 (Eq. (9)) and the choice of B (Eq. (7)).

Recent Followup Work During the preparation of camera ready, we are aware of a number of new related work [KMZ23, HJK+23, AS23, DLS23, GSWY23, DSZ23, GSY23] to this paper. This work studies softmax attention, and [KMZ23] shows to speed up the polynomial based attention unit via sketching techniques. [HJK+23] introduces hyperattention, which is able to handle longcontext attention in near-linear time. This work is mainly focusing classical attention computation, [AS23] consider a more general tensor version computation of attention scheme. This work studies the inference computation, the recent work [DLS23, GSWY23] also considers how to speedup the training process which involves the computation of gradient and hessian. Inspired by hardness result of this work, [DSZ23] shows that there is a binary classification tasks (involves two datasets) that softmax attention can distinguish but linear attention cannot. [GSY23] provides a result for computing the attention matrix differentially privately. [GSYZ23] introduce a quantum algorithm for computing attention matrix.