# hyperattention_longcontext_attention_in_nearlinear_time__aecf2242.pdf

Published as a conference paper at ICLR 2024

HYPERATTENTION: LONG-CONTEXT ATTENTION IN NEAR-LINEAR TIME

Insu Han Yale University insu.han@yale.edu

Rajesh Jayaram Google Research rkjayaram@google.com

Amin Karbasi Yale University, Google Research amin.karbasi@yale.edu

Vahab Mirrokni Google Research mirrokni@google.com

David P. Woodruff CMU, Google Research dwoodruf@cs.cmu.edu

Amir Zandieh Independent Researcher amir.zed512@gmail.com

We present an approximate attention mechanism named Hyper Attention to address the computational challenges posed by the growing complexity of long contexts used in Large Language Models (LLMs). Recent work suggests that in the worst-case scenario, quadratic time is necessary unless the entries of the attention matrix are bounded or the matrix has low stable rank. We introduce two parameters which measure: (1) the max column norm in the normalized attention matrix, and (2) the ratio of row norms in the unnormalized attention matrix after detecting and removing large entries. We use these fine-grained parameters to capture the hardness of the problem. Despite previous lower bounds, we are able to achieve a linear time sampling algorithm even when the matrix has unbounded entries or a large stable rank, provided the above parameters are small. Hyper Attention features a modular design that easily accommodates integration of other fast low-level implementations, particularly Flash Attention. Empirically, employing Locality Sensitive Hashing (LSH) to identify large entries, Hyper Attention outperforms existing methods, giving significant speed improvements compared to state-of-the-art solutions like Flash Attention. We validate the empirical performance of Hyper Attention on a variety of different long-context length datasets. For example, Hyper Attention makes the inference time of Chat GLM3 50% faster on 32k context length while perplexity increases from 5.6 to 6.3. On larger context length, e.g., 131k, with causal masking, Hyper Attention offers 22-fold speedup on a single attention layer.

1 INTRODUCTION

Transformers (Vaswani et al., 2017) have been successfully applied to a wide variety of learning tasks in areas such as natural language processing (Devlin et al., 2018; Yang et al., 2019; Brown et al., 2020; Raffel et al., 2020), computer vision (Carion et al., 2020; Dosovitskiy et al., 2021), and time series forecasting (Zhou et al., 2021). Despite their success, these models face serious scalability limitations because na ıve exact computation of their attention layers incurs quadratic (in the sequence length) runtime and memory complexities. This presents a fundamental challenge for scaling transformer models to longer context lengths.

Various approaches have been explored to tackle the quadratic-time attention layer, with one notable direction focusing on approximating intermediate matrices in attention layers. Methods for doing this include approximations by sparse matrices (Kitaev et al., 2020; Daras et al., 2020; Roy et al., 2021; Sun et al., 2021; Ding et al., 2023; Han et al., 2023), low-rank matrices (Choromanski et al., 2021; Katharopoulos et al., 2020; Kacham et al., 2023), or a combination of both (Chen et al., 2021b; Zaheer et al., 2020; Chen et al., 2021a; Dass et al., 2022). These methods aim to provide faster approximation to various components of attention, but none of them provide end-to-end approximations of the full dot-product attention. Moreover, none of these works support the use of causal masking, which is a crucial part of modern transformer architecture. In an ongoing work,

Published as a conference paper at ICLR 2024

Kacham et al. (2023) bypass the hardness of softmax by switching to polynomial functions and applying fast polynomial sketching methods (Ahle et al., 2020; Woodruff & Zandieh, 2020; 2022). On the negative side, recent theoretical bounds suggest that entry-wise approximations to the attention matrix are impossible in sub-quadratic time in general (Alman & Song, 2023).

Nevertheless, recently, KDEFormer (Zandieh et al., 2023) has provided provable approximation in subquadratic time, under the assumption that the entries of the attention matrix are bounded. Theoretically, KDEFormer runs in roughly O(n1.173) time; it employs kernel density estimation (KDE) to approximate column norms, allowing one to compute probabilities with which to sample columns of the attention matrix. However, the current algorithms for KDE are lacking practical efficiency (Charikar et al., 2020), and even in theory, there is a gap between the runtime of KDEFormer and the theoretically feasible O(n) time algorithms. In (Alman & Song, 2023), the authors demonstrated that under the same assumption of bounded entries, a nearly linear time O(n1+o(1)) algorithm is possible. However, their algorithm also involves using the polynomial method to approximate the softmax and is likely impractical (e.g., it was not empirically evaluated by the authors). In this work, we provide an algorithm which achieves the best of both worlds, being both a (1) practically efficient algorithm that (2) achieves the best possible near-linear time guarantee. Additionally, our approach supports casual masking, which was not possible via previous works.

1.1 PROBLEM STATEMENT

The dot-product attention (Vaswani et al., 2017) involves processing three input matrices: Q (queries), K (keys), V (values), all of size n d, where n is the number of tokens in the input sequence and d is the dimension of latent representations. This process outputs the following:

Att = D 1AV

Here, matrix A := exp QK is defined as the element-wise exponential of QK . Additionally, D is an n n diagonal matrix derived from the sum of rows of A, Di,i = Ai,: 1 for i [n]. In this context, matrix A is referred to as the attention matrix , and D 1A is called the softmax matrix . It is important to note that calculating the attention matrix A directly requires Θ(n2d) operations, and storing it consumes Θ(n2) memory. Consequently, a straightforward computation of Att demands a runtime of Ω(n2d) and Ω(n2) memory.

Our objective is to efficiently approximate the output matrix Att while retaining its spectral properties. Our strategy involves designing an efficient estimator for the diagonal scaling matrix D in near-linear time. Additionally, we aim to quickly approximate the matrix product of the softmax matrix D 1A and value matrix V through subsampling. To be more specific, our objective is to find a sampling matrix S Rm n with a limited number m = no(1) of rows, along with a diagonal matrix e D Rn n, such that the following bound on the operator norm of the error is met:

Att e D 1AS SV op ε D 1A op V op . (1)

1.2 OUR CONTRIBUTIONS

We show that efficiently solving the matrix multiplication component of the attention approximation problem in Eq. (1) can be achieved by defining the sampling matrix S based on the row norms of V . The more challenging aspect lies in obtaining a reliable spectral approximation for the diagonal matrix D. In a recent result, Zandieh et al. (2023) effectively leverages fast KDE solvers to attain a high-quality approximation of D. However, we streamline the KDEformer procedure and demonstrate that uniform sampling is sufficient to achieve the desired spectral guarantee, eliminating the need for importance sampling based on kernel densities. This significant simplification allows us to develop a practical and provably linear time algorithm.

In contrast to prior work (Alman & Song, 2023; Zandieh et al., 2023), our approach does not necessitate bounded entries or bounded stable rank. Furthermore, the fine-grained parameters we introduce to analyze the time complexity may remain small even when the entries in the attention matrix or the stable rank are large.

Published as a conference paper at ICLR 2024

Our work is inspired by the hard instance of Alman & Song (2023) for showing quadratic time lower bounds. Such instances have one randomly placed large entry in each row of the attention matrix. Our algorithm has an initial phase where we find large entries of the attention matrix in a black box manner, such as by using Locality Sensitive Hashing (Kitaev et al., 2020), or a possibly learned Count Sketch applied to the attention matrix (Charikar et al., 2002; Li et al., 2023a), or just a known heavy entry pattern (Chen et al., 2021a). We assume these procedures are fast, and that after removing the heavy entries, two parameters in the resulting attention matrix are small: (1) the max column ℓ2-norm, and (2) the ratio of row norms in the un-normalized attention matrix.

Prior work of Zandieh et al. (2023) used KDE to identify columns in the attention matrix with large norm and to perform approximate matrix product with the value matrix by sampling such columns. As mentioned, finding such columns requires at least O(n1.173) time. Instead, we observe that by doing a one-sided sampling from the squared row norms of V , we can avoid the use of KDEs and achieve the same spectral norm guarantee in terms of the stable rank. Although our algorithm is simple and just samples by the row norms of the value matrix (or even samples uniformly in practice), the main technical challenge is that we do not know the row norms of the attention matrix needed in order to normalize it and produce a proper factorization of it. This is reminiscent of the quadratic time hard instance of (Alman & Song, 2023) where we may not be able to find a heavy entry in a row easily, and thus cannot normalize by its norm in the attention matrix. Our parameters (1) and (2) above allow us to argue that the heavy entries, if they exist, are not distributed in the worst possible way.

Empirically, Hyper Attention demonstrates significant speed improvements, achieving over a 50 acceleration in forward and backward propagation for sequence lengths of n = 131k. When dealing with causal masking, the method still delivers a substantial 5 speedup. Moreover, when our approach is applied to pretrained LLMs, e.g., chatglm3-6b-32k (Du et al., 2022) and evaluated on long-context benchmark datasets, so-called Long Bench (Bai et al., 2023), it maintains performance levels that closely match those of the original models, even without the need for fine-tuning. Furthermore, we investigate task-specific evaluations and discover summarization and code completion tasks are more robust to approximate attention layers than question answerings.

2 PRELIMINARIES

We make use of the Hamming sorted LSH, a variant of angular locality-sensitive hashing introduced in the work by Zandieh et al. (2023). In this variant, the hash buckets are arranged in order of their Hamming distances. This LSH variant is particularly well-suited for designing GPU-friendly algorithms aimed at identifying dominant entries within the attention matrix A. In the context of Hamming sorted LSH, if we let H : Rd [B] be a hash function with B buckets drawn from an LSH family, then the collision probability Pr H[H(q) = H(k)] is roughly proportional to q, k . A very useful property of this LSH variant is that its buckets are ordered in such a way that geometrically adjacent buckets have consecutive buckets. We provide the following definition.

Definition 1 (Hamming sorted LSH, Definition 7.3 of (Zandieh et al., 2023)). For positive integer r, there exists an LSH function H : Rd [2r], such that for any x, y Rd its collision probability

is Pr[H(x) = H(y)] = 1 θ(x,y)

π r where θ(x, y) := cos 1 x y x y . Furthermore, this LSH function hashes similar points to adjacent buckets. Specifically, the probability that two points end

up in adjacent buckets is given by Pr [H(x) = H(y) 1 (mod 2r)] = 2θ(x,y)

Using this LSH function, as demonstrated by Zandieh et al. (2023), we can sort keys and queries within an attention layer in such a way that large entries get shifted towards the diagonal of the attention matrix. Subsequently, these significant entries in the attention matrix can be captured by computing equal-sized blocks along the diagonal. This approach aligns with the block-memory access patterns of modern hardware and can be efficiently parallelized through batching across blocks.

Published as a conference paper at ICLR 2024

k1 k2 . . . k9

q2 q8 q5 q4 q3 q6 q7 q9 q1

k2 k8 k4 k7 k3 k6 k9 k1 k5

Figure 1: How sort LSH finds large entries of A: (Left) Keys and queries undergo hashing using the Hamming ordered LSH H( ). (Middle) Keys and queries are rearranged based on their hash buckets. Attention matrix after applying these row and column permutations is denoted as APQ,PK. Large entries of APQ,PK are concentrated around the diagonal blocks. (Right) rows and columns permutations are reversed on the attention matrix and M H A is highlighted.

Algorithm 1: sort LSH for locating large entries of A

1: input: matrices Q, K Rn d, and block size b 2: Let H( ) be a Hamming sorted LSH as per Definition 1 and hash rows of Q, K 3: Let PK, PQ Sym(n) be permutations satisfying PK(i) < PK(j) if H(Ki,:) H(Kj,:) and PQ(i) < PQ(j) if H(Qi,:) H(Qj,:) 4: return Mask matrix M H {0, 1}n n defined as M H i,j = 1{ PQ(i)/b = PK(j)/b }

3 ALGORITHM

To obtain a spectral guarantee when approximating Att, our initial step involves producing a 1 ε approximation of the diagonal entries in the matrix D. Subsequently, we approximate the matrix product between D 1A and V via sampling according to the squared row ℓ2-norms of V .

Estimating D. Our procedure for approximating D consists of two steps. Initially, we identify the dominant entries within the attention matrix using an algorithm rooted in the Hamming sorted LSH, as defined in Definition 1. The second step revolves around randomly selecting a small subset of keys K. We will demonstrate that under certain mild assumptions about matrices A and D, this simple approach allows us to establish spectral bounds on the estimated matrix. Our aim is to find a sufficiently precise approximate matrix e D that satisfies: e D 1 D 1 A op ε

D 1A op (2)

Our assumption is that the column norms of the softmax matrix exhibit a relatively uniform distribution. To be more precise, we assume that for any i [n] there exists some α = no(1) such that D 1A e(i) 2 2 α

n. It s worth noting that our assumption is more general in comparison to the bounded input entries assumption made in (Alman & Song, 2023). In fact, if their assumption holds, it implies that D 1A e(i) 2 2 no(1)

n for all i [n]. In Section 4.3, we empirically compute α to be the maximum of the squared ℓ2-norms of the columns in D 1A and verify that it is indeed sublinear in n.

The first step of our empirical algorithm involves identifying large entries of the attention matrix A through hashing keys and queries into uniformly-sized buckets using the Hamming sorted LSH, which we refer to as sort LSH. This process is detailed in Algorithm 1 and is visually illustrated in Fig. 1. Note that we also mention other was of identifying large patterns, such as checking for a known heavy hitter pattern, or using Count Sketch which we describe more below.

Algorithm 1 returns a sparse mask designed to isolate the dominant entries of the attention matrix. Given this mask, we compute an approximation of the matrix D in Algorithm 2 that satisfies the spectral guarantee in Eq. (2). This algorithm accomplishes this by combining the attention values corresponding to the mask with a randomly chosen subset of columns from the attention matrix.

Published as a conference paper at ICLR 2024

Algorithm 2: Approx D for estimating diagonal matrix D

1: input: matrices Q, K Rn d, large entries mask M H {0, 1}n n, parameters κ > 0, ε > 1 κ4 , α > ε2κ, and integer m 2: Randomly choose a subset T [n] with cardinality |T| = m 3: τ maxj T 1 M H j,:, exp(KQ j,:) {estimate of maximum un-masked row sum} 4: Generate m i.i.d. sample ℓ1, ℓ2, . . . ℓm Unif([n]) 5: for i [n] do

6: Ci Θ ε2m n log n M H i,:, exp(KQ i,:) + τ/κ {(capped) row-sum of masked entries}

{di below is the (capped) row-sum estimator for the un-masked entries} 7: di n

j [m](1 M H i,ℓj) min exp Qi,:, Kℓj,: , Ci

8: di M H i,:, exp(KQ i,:) + max (di, τ/κ) {full row-sum estimate}

9: return diagonal matrix e D = diag({ di}n i=1)

Algorithm 3: Hyper Attention: attention mechanism in near-linear time

1: input: matrices Q, K, V Rn d, mask matrix M H {0, 1}n n, and parameter ε > 1 no(1) 2: Run Algorithm 2 and let e D APPROXD Q, K, M H, no(1), ε, no(1), d no(1)

3: Let S Rm n be an i.i.d. sampling matrix based on squared row norms of V as in Lemma 2 4: return e D and S

The assumptions of Lemma 1 are used to ensure that the variance of the estimator is small, and the same complexity of the algorithm increases as a function of the parameters α, κ. We remark that our algorithm is versatile and can function effectively with a predefined mask that specifies the positions of dominant entries within the attention matrix, mirroring the approach taken in (Chen et al., 2021a). The main guarantee provided by this algorithm is given in Lemma 1. Lemma 1 (Approximating D). For any Q, K Rn d, let A = exp(QK ). Also let D Rn n be the diagonal matrix with Di,i = Ai,: 1. Additionally, suppose that α = n maxi [n] D 1A e(i) 2 2. For any mask matrix M H {0, 1}n n let us define the condition

number κ := maxi [n] 1 M H i,:,Ai,: minj [n] 1 M H j,:,Aj,: . If m = Ω κ7 α2

ε6 log n , the output e D of Algorithm 2 satisfies

Eq. (2) with probability at least 1 1 poly(n).

Approximating the product of softmax matrix D 1A and values matrix V . Given a e D that meets the spectral approximation conditions as in Eq. (2), we can achieve the spectral constraint in Eq. (1), by finding a sampling matrix that satisfies the following condition, e D 1AS SV e D 1AV op ε

2 D 1A op V op (3)

We can efficiently find a sampling matrix S Rm n with a small number m of rows that satisfies Eq. (3) by leveraging well-established techniques in Approximate Matrix Multiplication (AMM).

Lemma 2. For any matrices e D, A Rn n, V Rn d consider a sampling matrix S Rm n constructed as follows: first generate m i.i.d. samples ℓ1, . . . ℓm [n] according to squared

row norms of matrix V , i.e., Vi,: 2 2 V 2 F , then let the rth row of S be V F m Vℓr,: 2 e(ℓr). If m =

Ω ε 2d srank( e D 1A) for some ε > 0, the following holds with probability at least 0.99: e D 1AS SV e D 1AV op ε e D 1A op V op .

The above result is standard and for proof refer to (Drineas & Kannan, 2001).

Main Theorem. Now, we can integrate the subroutines for approximating the diagonal e D and approximating the matrix product between e D 1A and values matrix V . With this, we introduce the

Published as a conference paper at ICLR 2024

Hyper Attention, an efficient algorithm that can approximate the attention mechanism with spectral guarantees as per Eq. (1) in near-linear time. Our Algorithm 3 takes as input a mask M H that defines the positions of dominant entries within the attention matrix. This mask can be generated using the sort LSH algorithm (Algorithm 1), or it can be a predefined mask similar to the approach taken in (Chen et al., 2021a). The large entries mask M H is assumed to be sparse by design and its number of nonzero entries is bounded nnz(M H) = n1+o(1). We now introduce our main theorem which will be proved in Appendix A. Theorem 1 (Hyper Attention guarantee). For any matrices Q, K, V Rn d, any mask matrix M H {0, 1}n n, and parameter ε > 1 no(1) , let A = exp(QK ) and let D Rn n be the diag-

onal matrix with Di,i = Ai,: 1. If maxi [n] D 1A e(i) 2 2 no(1)

n and maxi [n] 1 M H i,:,Ai,: minj [n] 1 M H j,:,Aj,:

no(1) then with probability at least 0.98 the outputs S, e D of Algorithm 3 satisfy the spectral condition as in Eq. (1). Moreover, this algorithm s runtime is O(d n1+o(1) + d nnz(M H)).

Note that even if M H is not given to us, but M H can be found in d n1+o(1) time, the theorem holds. We also give examples when this is possible by using Hamming sorted LSH, which our experiments are based on, or using the Expander Sketch of (Larsen et al., 2016) which is based on Count Sketch (Charikar et al., 2002) but also gives a fast recovery time. In the supplementary we show: Corollary 1 (Hyper Attention with sort LSH). Suppose all preconditions of Theorem 1 hold. Further, suppose the mask matrix M H {0, 1}n n is generated as in Algorithm 1 with block size b = no(1)

and r = log2 n in Definition 1. We further assume there are at most n1+o(1) pairs (i, j) with θ(Qi, , Kj, ) π

2 (1 o(1)), where θ is as in Definition 1. Then with probability 1 1/no(1), the M H we find in Algorithm 1 has at most n1+o(1) non-zero entries and with probability at least .98, the outputs S, e D of Algorithm 3 satisfy Eq. (1) and the overall runtime is O(d n1+o(1)).

We note the assumption on the angles of the rows of Q and K in Corollary 1 is satisfied if most rows are drawn uniformly at random from a d-dimensional sphere, since in this case they will be nearly orthogonal, i.e., have angle at most π

2 (1 o(1)) with high probability. However, the corollarly also allows n1+o(1) pairs of rows to have arbitrary angle, which may be more realistic. Corollary 2 (Hyper Attention with Expander Sketch). Suppose all preconditions of Theorem 1 hold. Further, suppose the mask matrix M H {0, 1}n n is defined such that there is a threshold τ =

no(1) such that M H i,j = 1 if and only if (QK )2 i,j QK ej 2 2 τ . Then we can find M H exactly with probability 1 O(1/n2), and with probability at least .98, the outputs S, e D of Algorithm 3 satisfy Eq. (1). The runtime is O(d n1+o(1)).

The key idea behind the proof of Corollary 2 is to first sketch Q by an Expander Sketch T, which is efficient since T has a small number of rows. Then compute (T Q) K which is again efficient since (T Q) has a small number of rows. Thus, we never form the matrix Q K .

3.1 CAUSAL MASKING

Language models commonly employ causal masking. The causal mask is a lower triangular binary square matrix denoted as M C where M C i,j = 1{i j}. The causal attention mechanism is defined as:

Att C = D 1 C (M C A)V ,

where A := exp QK is defined as before and DC is an n n diagonal matrix derived from the sum of rows of the masked attention M C A, specifically [DC]i,i = M C i,:, Ai,: for i [n]. To approximate causal attention with a spectral guarantee, we require two components. First, we need a spectral approximation for the diagonal matrix DC. Second, we need to approximate the matrix product between D 1 C (M C A) and V , which can be achieved using the same sampling technique as described in Algorithm 3 and Lemma 2. The first component is more intricate, and we employ a recursive method to address it. So we focus on how to efficiently approximate the diagonal DC.

Our approach is based on a key observation, as depicted in Fig. 2. The masked attention M C A can be decomposed into three non-zero matrices, each of which has half the size of the original attention matrix. The block A21, located entirely below the diagonal is unmasked attention. Consequently,

Published as a conference paper at ICLR 2024

Causal Approx D

(recursive call)

Figure 2: Causal attention matrix can be divided into three equal-sized non-zero sections: M C 1 A11 and M C 2 A22 are both causal attention matrices, and A21 is an unmasked attention matrix.

Algorithm 4: Causal Approx D, recursive approximation of DC for causal masking

1: input: matrices Q, K Rn d

2: Split Q and K into equal sized sub-matrices: Q = [Q 1 , Q 2 ] and K = [K 1 , K 2 ]

3: e DC11 Causal Approx D(Q1, K1) and e DC22 Causal Approx D(Q2, K2) 4: Run the unmasked algorithm Approx D (Algorithm 2) on Q2, K1 to get e D21

5: return e DC = e DC11, 0 0, e D21 + e DC22

we can approximate its row sums using Algorithm 2. The two diagonal blocks M C 1 A11 and M C 2 A22 shown in Fig. 2 are causal attentions with half the original size. To handle these, we apply a recursive approach and further partition them into smaller blocks, and repeat this procedure. We present a pseudocode for this procedure in Algorithm 4.

4 EXPERIMENTS

In this section, we benchmark our algorithms by scaling up existing large language models to handle long-range sequences. All experiments are performed on a single A100 GPU with 40 GB memory and we use Flash Attention 2 (Dao, 2023) for the exact attention computation.

Implementation Detail. We implement Hyper Attention based on sort LSH and uniform column sampling. Specifically, we first apply sort LSH to all rows in Q, V Rn d. Then, each set of rows is partitioned into b groups where b is the block size as in Fig. 1. The i-th set of rows in Q is multiplied by the corresponding set in K, resulting in a block-diagonal approximation of APQ,PK. Next, we optimize Algorithm 2 for approximating D by sharing random indices {ℓj}m j=1 with all rows in Q. This corresponds to uniformly sampling m rows in V . To further simplify, we reuse indices {ℓj}m j=1 for the Approximate Matrix Multiplication (AMM) in Lemma 2. The required operations involve permuting n rows, reshaping tensors, and small matrix multiplications. Since every batch, head and block has the same configurations, the implementation can be parallelized using GPUs.

4.1 MONKEY PATCHING SELF-ATTENTION

We first evaluate Hyper Attention on two pre-trained LLMs. We choose three models with different architectures that are widely used in practical applications: chatglm3-6b-32k (Du et al., 2022), and phi-1.5 (Li et al., 2023b). We patch their final ℓattention layers by replacing with Hyper Attentions where ℓcan vary from 0 to the number of all attention layers in each LLM. Note that attentions in both models requires causal masking and we make use of Algorithm 4 by recursively applying it until the input sequence lengths n are less than 4,096. We set both bucket size b and the number of sampled columns m to 256 for all sequence lengths. We evaluate the performance of such monkey patched models in terms of perplexity and speedup.

We use Long Bench (Bai et al., 2023), a collection of long context benchmark datasets, which contains 6 different tasks ranging from single and multiple-document question answering, summariza-

Published as a conference paper at ICLR 2024

0 4 8 12 16 20 24 28 number of replaced layers

(a) chatglm3-6b-32k

0 4 8 12 16 20 24 number of replaced layers

(b) phi-1.5

Figure 3: Perplexity and speedup of chatglm3-6b-32k (left) and phi-1.5 (right) monkey patched with Hyper Attention. We vary the number of replaced attention layers in the final order.

Number of Replaced Layers

single-qa multi-qa summarization few-shot synthetic code

0 (exact) 96.07 88.19 60.29 203.39 103.00 107.16 7 91.82 90.02 60.84 202.77 104.00 107.42 14 87.98 89.64 58.84 200.46 101.00 103.19 21 84.61 77.51 61.55 201.13 97.00 101.12 28 53.30 73.98 60.45 200.01 71.00 104.99

Table 1: Performance evaluation of chatglm3-6b-32k equipped with Hyper Attentions on Long Bench datasets (Bai et al., 2023). They contain 6 different tasks and we evaluate each of them with its own metric where higher value indicates better performance.

tion, few-shot learning, synthetic tasks, and code completion. We select a subset of dataset whose encoded sequence lengths are larger than 32,768 and trim them if the length is over 32,768 so that all data have sequence lengths of 32,768. Then, we compute the perplexity (i.e., loss on next tokens prediction) of each model. To highlight the scalability on the long sequences, we calculate the total speedup on all attention layers whether performed by Hyper Attention or Flash Attention.

The results are summarized in Fig. 3. Observe that chatglm3-6b-32k shows a reasonable perplexity even after monkey patched by Hyper Attention, e.g., after replacing 20 layers the perplexity increases approximately by 1 and it slowly goes up until 24 layers. But it improves runtimes in attention layers about 50%. If all the layers are replaced then the perplexity goes to up 12 but it runs about 2.3 faster. For phi-1.5, similar happens but the perplexities are linearly increasing as the number of Hyper Attention grows.

In addition, we evaluate the performances of monkey patched chatglm3-6b-32k on Long Bench datasets and compute task-specific evaluation scores on each task including single-document question answering, multiple-document question answering, summarization, few-shot learning, synthetic tasks and code completion. Results are provided in Table 1. While replacing Hyper Attention generally leads to performance degradation, we observe that its role can vary depending on the task at hand. For example, summarization and code completion are more robust to other tasks. Notably, when half of all attention layers are patched (i.e., 14 layers), we verify that most of the tasks do not degrade more than 13%. In particular, the performance of the summarization task remained almost unchanged, suggesting that this task may be more robust to partial alterations in the attention mechanism. We recall that computations in attention layers can be 1.5 faster when n = 32k.

4.2 SINGLE SELF ATTENTION LAYER

We further explore the speedup of Hyper Attention with varying sequence lengths from 4,096 to 131,072. We measure wall-clock times of both forward and forward+backward operations when they are computed with Flash Attention or are accelerated by Hyper Attention. We measure the times with and without causal masking. All inputs Q, K, V have the same length and their dimensions are fixed to d = 64 and the number of attention heads is set by 12. We chose the same parameters in

Published as a conference paper at ICLR 2024

4k 8k 16k 32k 65k 131k sequence length n

Forward Forward+Backward

(a) Without causal masking

4k 8k 16k 32k 65k 131k sequence length n

Forward Forward+Backward

(b) With causal masking

Figure 4: Speedup of the exact computation using Flash Attention (Dao, 2023) and Hyper Attention (this work) in single self-attention layer during forward and backward operations. For n =131k, Hyper Attention runs up to 54 faster without causal masking and 5.4 with causal masking.

Hyper Attention as described in the previous section. In Fig. 4, we observe that Hyper Attention runs to up 54 faster without causal masking and 5.4 when the causal masking applies. Although time complexities of both causal masking and non-masking are the same, a practical algorithm for causal masking (Algorithm 1) requires additional operations such as partitioning Q, K, V , and merging attention outputs which result in an increase of practical runtime. However, those speedups will increase when the sequence length n grows. We believe this opens the door to scale self-attention not only for inference but also for training or fine-tuning the LLMs to fit in significantly long sequences.

4.3 EMPIRICAL VERIFICATION OF ASSUMPTION

In addition, we empirically verify our assumption in Theorem 1, i.e., squared ℓ2-norm of columns in D 1A is upper bounded by no(1)

n . We investigate pretrained transformers with and without causal masking. For non-causal masking attention, we use the T2T-Vi T model (Yuan et al., 2021) and take Q, K, V from its first attention layer on the Image Net test data set as it is the main computational bottleneck. For each image, we compute α to be the largest of the squared ℓ2-norms of the columns in D 1A ei 2 2 and collect the value over 50k images. The sequence length of the model is given by n = 3,136 and the averaged values of α is observed to be 8.1801. This is much smaller than n and can be possibly sublinear in n.

To further investigate the dependence on n, we utilize the chatglm3-6b-32k and Long Bench narrative-qa dataset, changing the sequence length n from 1k to 9k. We trim or pad the input context so that its length is strictly n. Unlike the vision model, we notice that the first columns in D 1A often contain heavy entries; hence we compute α as the largest squared norm excluding the first 32 columns. We collect these values for all heads and layers and compute their average. Fig. 5 plots the value of α

n with various sequence length n. It is observed that the value of α

n decreases as n grows, supporting the claim that our assumption α = no(1) holds in practice.

1k 2k 3k 4k 5k 6k 7k 8k 9k sequence length n

(1/n) maxi D 1A ei 2 2

Figure 5: Empirical values of α (i.e., the largest squared ℓ2-norms of the columns in D 1A). Theorem 1 assumes α = no(1) and the plot supports the claim that our assumption holds in practice.

Published as a conference paper at ICLR 2024

5 CONCLUSION

In this work, we propose a simple linear time attention approximation algorithm by simplifying the existing algorithm based on kernel density estimation (KDE). We introduce a more general parameterization for a spectral approximation guarantee based on the condition number, which does not require assumptions used in prior work. Our algorithm makes use of sort LSH to find large entries and we adopt fast matrix multiplication via row norm sampling. We additionally study how our algorithm is used for causal masking by recursive partitioning. Empirically, we illustrate that pre-trained LLMs using our algorithm can enhance both inference and training speeds with only minimal performance degradation.

ACKNOWLEDGEMENT

D. Woodruff worked on this while at Google Research and also while visiting the Simons Institute for the Theory of Computing.

Thomas D Ahle, Michael Kapralov, Jakob BT Knudsen, Rasmus Pagh, Ameya Velingker, David P Woodruff, and Amir Zandieh. Oblivious sketching of high-degree polynomial kernels. In Symposium on Discrete Algorithms (SODA), 2020.

Josh Alman and Zhao Song. Fast attention requires bounded entries. ar Xiv preprint ar Xiv:2302.13214, 2023.

Yushi Bai, Xin Lv, Jiajie Zhang, Hongchang Lyu, Jiankai Tang, Zhidian Huang, Zhengxiao Du, Xiao Liu, Aohan Zeng, Lei Hou, Yuxiao Dong, Jie Tang, and Juanzi Li. Long Bench: A Bilingual, Multitask Benchmark for Long Context Understanding, 2023.

Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. Neural Information Processing Systems (Neur IPS), 2020.

Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexander Kirillov, and Sergey Zagoruyko. End-to-end object detection with transformers. In Proceedings of the European Conference on Computer Vision(ECCV), 2020.

Moses Charikar, Kevin Chen, and Martin Farach-Colton. Finding frequent items in data streams. In International Colloquium on Automata, Languages, and Programming (ICAMP), 2002.

Moses Charikar, Michael Kapralov, Navid Nouri, and Paris Siminelakis. Kernel density estimation through density constrained near neighbor search. In Foundations of Computer Science (FOCS), 2020.

Beidi Chen, Tri Dao, Kaizhao Liang, Jiaming Yang, Zhao Song, Atri Rudra, and Christopher Re.

Pixelated Butterfly: Simple and Efficient Sparse training for Neural Network Models. In International Conference on Learning Representations (ICLR), 2021a.

Beidi Chen, Tri Dao, Eric Winsor, Zhao Song, Atri Rudra, and Christopher Re. Scatterbrain: Unifying sparse and low-rank attention. Neural Information Processing Systems (Neur IPS), 2021b.

Krzysztof Marcin Choromanski, Valerii Likhosherstov, David Dohan, Xingyou Song, Andreea Gane, Tamas Sarlos, Peter Hawkins, Jared Quincy Davis, Afroz Mohiuddin, Lukasz Kaiser, et al. Rethinking Attention with Performers. In International Conference on Learning Representations (ICLR), 2021.

Tri Dao. Flash Attention-2: Faster Attention with Better Parallelism and Work Partitioning. 2023.

Giannis Daras, Nikita Kitaev, Augustus Odena, and Alexandros G Dimakis. Smyrf-efficient attention using asymmetric clustering. Neural Information Processing Systems (Neur IPS), 2020.

Published as a conference paper at ICLR 2024

Jyotikrishna Dass, Shang Wu, Huihong Shi, Chaojian Li, Zhifan Ye, Zhongfeng Wang, and Yingyan Lin. Vitality: Unifying low-rank and sparse approximation for vision transformer acceleration with a linear taylor attention. ar Xiv preprint ar Xiv:2211.05109, 2022.

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. In Conference of the North American Association for Computational Linguistics (NAACL), 2018.

Jiayu Ding, Shuming Ma, Li Dong, Xingxing Zhang, Shaohan Huang, Wenhui Wang, and Furu Wei.

Longnet: Scaling transformers to 1,000,000,000 tokens. ar Xiv preprint ar Xiv:2307.02486, 2023.

Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In International Conference on Learning Representations (ICLR), 2021.

Petros Drineas and Ravi Kannan. Fast monte-carlo algorithms for approximate matrix multiplication. In Proceedings 42nd IEEE Symposium on Foundations of Computer Science, pp. 452 459. IEEE, 2001.

Zhengxiao Du, Yujie Qian, Xiao Liu, Ming Ding, Jiezhong Qiu, Zhilin Yang, and Jie Tang. Glm: General language model pretraining with autoregressive blank infilling. 2022.

Chi Han, Qifan Wang, Wenhan Xiong, Yu Chen, Heng Ji, and Sinong Wang. Lm-infinite: Simple on-the-fly length generalization for large language models. ar Xiv preprint ar Xiv:2308.16137, 2023.

Praneeth Kacham, Vahab Mirrokni, and Peilin Zhong. Polysketchformer: Fast transformers via sketches for polynomial kernels. ar Xiv preprint ar Xiv:2310.01655, 2023.

Angelos Katharopoulos, Apoorv Vyas, Nikolaos Pappas, and Francois Fleuret. Transformers are rnns: Fast autoregressive transformers with linear attention. In International Conference on Machine Learning (ICML), 2020.

Nikita Kitaev, Lukasz Kaiser, and Anselm Levskaya. Reformer: The Efficient Transformer. In International Conference on Learning Representations (ICLR), 2020.

Kasper Green Larsen, Jelani Nelson, Huy L. Nguyen, and Mikkel Thorup. Heavy hitters via clusterpreserving clustering. Co RR, abs/1604.01357, 2016.

Franc ois Le Gall. Faster algorithms for rectangular matrix multiplication. In 2012 IEEE 53rd annual symposium on foundations of computer science, pp. 514 523. IEEE, 2012.

Yi Li, Honghao Lin, Simin Liu, Ali Vakilian, and David P. Woodruff. Learning the positions in countsketch. In The Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023. Open Review.net, 2023a.

Yuanzhi Li, S ebastien Bubeck, Ronen Eldan, Allie Del Giorno, Suriya Gunasekar, and Yin Tat Lee. Textbooks Are All You Need II: phi-1.5 technical report. ar Xiv preprint ar Xiv:2309.05463, 2023b.

Colin Mc Diarmid. Concentration. In Probabilistic methods for algorithmic discrete mathematics, pp. 195 248. Springer, 1998.

Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. Exploring the Limits of Transfer Learning with a Unified Text-to Text Transformer. Journal of Machine Learning Research (JMLR), 2020.

Aurko Roy, Mohammad Saffar, Ashish Vaswani, and David Grangier. Efficient content-based sparse attention with routing transformers. Transactions of the Association for Computational Linguistics (ACL), 2021.

Zhiqing Sun, Yiming Yang, and Shinjae Yoo. Sparse Attention with Learning to Hash. In International Conference on Learning Representations (ICLR), 2021.

Published as a conference paper at ICLR 2024

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need. Neural Information Processing Systems (Neur IPS), 2017.

David Woodruff and Amir Zandieh. Leverage score sampling for tensor product matrices in input sparsity time. In International Conference on Machine Learning, pp. 23933 23964. PMLR, 2022.

David P Woodruff and Amir Zandieh. Near Input Sparsity Time Kernel Embeddings via Adaptive Sampling. In International Conference on Machine Learning (ICML), 2020.

Zhilin Yang, Zihang Dai, Yiming Yang, Jaime Carbonell, Russ R Salakhutdinov, and Quoc V Le.

Xlnet: Generalized autoregressive pretraining for language understanding. Neural Information Processing Systems (Neur IPS), 2019.

Li Yuan, Yunpeng Chen, Tao Wang, Weihao Yu, Yujun Shi, Zi-Hang Jiang, Francis EH Tay, Jiashi Feng, and Shuicheng Yan. Tokens-to-token vit: Training vision transformers from scratch on imagenet. In International Conference on Computer Vision (ICCV), 2021.

Manzil Zaheer, Guru Guruganesh, Kumar Avinava Dubey, Joshua Ainslie, Chris Alberti, Santiago Ontanon, Philip Pham, Anirudh Ravula, Qifan Wang, Li Yang, et al. Big bird: Transformers for longer sequences. In Neural Information Processing Systems (Neur IPS), 2020.

Amir Zandieh, Insu Han, Majid Daliri, and Amin Karbasi. Kdeformer: Accelerating transformers via kernel density estimation. In International Conference on Machine Learning (ICML), 2023.

Haoyi Zhou, Shanghang Zhang, Jieqi Peng, Shuai Zhang, Jianxin Li, Hui Xiong, and Wancai Zhang.

Informer: Beyond efficient transformer for long sequence time-series forecasting. In Conference on Artificial Intelligence (AAAI), 2021.

A OMITTED PROOFS

Here we include the proofs that were omitted in the main body of the paper. First, we present the proof of Lemma 1.

Proof of Lemma 1: First, we show that τ calculated in line 3 of Algorithm 2 is close to the maximum row sum of the matrix (1n M H) A. It is easy to check that τ

κ 1 M H i,:, exp(KQ i,:) τκ for all i [n] because of the definition of κ in the lemma statement. Furthermore, if we define the set:

S0 := i [n] : 1 M H i,:, exp(KQ i,:) > τ , (4)

then we can show that |S0| is small. Recall that τ is the maximum of the row sums of a random subset of rows, denoted by T where |T| = m. Hence, Pr[|S0| t] (1 t/n)m. Choosing m = Ω(κ7α2ε 6 log n) and t = O(κ 7α 2ε6n) gives that with probability at least 1 1 poly(n)

|S0| O κ 7 α 2 ε6 n . (5)

Next, let us define the upper-capped version of matrix A where entries of i-th row on positions where the mask M H value is equal to zero are capped at value Ci (line 6 of the algorithm) as:

e A Rn n : e Ai,j := min (Ai,j, Ci) if M H i,j = 0 Ai,j otherwise , for every i, j [n].

Published as a conference paper at ICLR 2024

We proceed by bounding the total mass of large entries of matrix D 1A lost through capping (i.e., entries of A that are larger than thresholds Ci). If we define constant b C := ε2m κ2n log n, we can write, D 1(A e A) 1 = X

i,j [n] (1 M H i,j) (Ai,j min (Ai,j, Ci)) /Di,i

i,j [n] (2t 1) b CDi,i<Ai,j Ci (2t+1 1) b CDi,i

1{M H i,j=0} Ai,j min (Ai,j, Ci)

t=0 2t+1 b C n i, j [n] : M H i,j = 0 & Ai,j > Ci + (2t 1) b CDi,i o

t=0 2t+1 b C α

(2t b C)2 = O ε4

The inequality in Eq. (6) follows because, for every i [n], the cardinality of the set n i, j [n] : M H i,j = 0 & Ai,j > Ci + (2t 1) b CDi,i o must be bounded by α (2t b C)2 . The proof of

this is by contradiction because otherwise, there must be an l [n] such that the cardinality of

the set Hl := n i [n] : M H i,l = 0 V Ai,l > Ci + (2t 1) b CDi,i o is at least |Hl| > α n (2t b C)2 .

This implies that D 1A e(l) 2 2 P i Hl(Ai,l/Dii)2 = P i Hl

Ci Di,i + (2t 1) b C 2

|Hl| ε2m κ2n log n + (2t 1) b C 2 > α

n, however, this contradicts with the precondition of the lemma

about α = n maxi [n] D 1A e(i) 2 2. Now, if we defined the sets S1, S2 [n] as

S1 = i [n] : εDii

3 < Ai,: e Ai,: 1 Dii

, S2 = i [n] : Ai,: e Ai,: 1 > Dii

then it follows from Eq. (6) that the cardinalities of S1 and S2 are bounded by

|S1| O κ 4 α 1 ε3n , |S2| O κ 4 α 1 ε4n . (8)

Next note that di computed in line 7 of the algorithm is an estimator for i-th row norm of the capped and masked matrix (1n M H) e A. Let us define an estimator for the unmasked capped matrix e A as bdi := di + M H i,:, Ai,: . By invoking Chernoff-Hoeffding inequality (see e.g., (Mc Diarmid,

1998)) along with union bound, because the lemma statement assumes that m = Ω κ7 α2

the following holds simultaneously for all i [n] \ S2 with probability 1 1 poly(n): e Ai,: 1 1 + ε/6 bdi

e Ai,: 1 1 ε/6. (9)

This inequality combined with definition of S1, S2 in Eq. (7) implies that for any i [n] \ (S1 S2), (1 ε/2) D 1 i,i bd 1 i (1+ε/2) D 1 i,i . Now we bound the operator norm of the error as follows: ( e D 1 D 1)A op e D 1 D 1

S1 S2 A op + e D 1 D 1

[n]\(S1 S2) A op

D 1 S1 A op + e D 1 D 1

S2 A op + ε

D 1 [n]\(S1 S2)A op

D 1 S1 A op + κ2 D 1 S2 S0A op + κ D 1 S2\S0A op + ε

D 1 [n]\(S1 S2)A op

D 1 S1 A op + κ2 D 1 S0 A op + κ D 1 S2 A op + ε

D 1 [n]\(S1 S2)A op , (10)

Published as a conference paper at ICLR 2024

where the second inequality above follows from the inequality in Eq. (9) and also because the definition of S1 ensures that e Di,i (1 1/3) Di,i for any i S1. The third inequality above follows because the lower capping in line 8 of the algorithm ensures that e Dj,j M H j,:, Aj,: + τ/κ for any j S2 while Dj,j M H j,:, Aj,: + τ for any j / S0 by definition of set S0 and we know that Dj,j M H j,:, Aj,: + τκ for j S0.

Finally we conclude the proof by bounding the terms D 1 Sr A op for r {0, 1, 2} in Eq. (10).

Fix some r {0, 1, 2}. Let v be the unit-normed vector that realizes the operator norm of D 1 Sr A. Since D 1 Sr A is a non-negative matrix, w.l.o.g., we can assume that v is a non-negative vector. More precisely v := arg max x Rn + x 2=1

D 1 Sr A x 2. One has that D 1 Sr A v 2 = D 1 Sr A op. We

define the sequence of binary matrices B0, B1, B2 . . . which have same shape as D 1 Sr A as follows:

Bt i,j := 1n 2 t 1

α/n<[D 1 Sr A]i,j 2 t

α/n o for every integers t 0. (11)

Note that because of the precondition of the lemma about α = n maxi [n] D 1A e(i) 2 2 which implies [D 1A]i,j p

α/n, we have the following inequality on each entry of the matrix D 1 Sr A:

Since D 1 Sr A and v both have non-negative entries, the above inequality implies the following:

D 1 Sr A v 2 p

t=0 2 t Bt v

t=0 2 t Bt v 2 . (12)

Now to bound Bt v 2 we first find bounds on the number of 1 s in rows and columns of Bt. Using the definition of Bt in Eq. (11) and the fact that row sums in matrix D 1A are equal to 1, we have: Bt i,: 0 min(2t+1p

n/α, n). (13)

Additionally, using the precondition of the lemma about α = n maxi [n] D 1A e(i) 2 2, we have: Bt :,j 0 min(22t+2, |Sr|). (14)

Now we bound the norm Bt v 2 for an arbitrary integer t 0 as follows:

Bt i,: 0 Bt i,: v 2 2 (Cauchy Schwarz inequality)

Bt i,: v 2 2 = 2t+1p

i=1 Bt i,j v2 j

Bt :,j 0 v2 j

n/α min(22t+2, |Sr|) X

j [n] v2 j = 2t+1p

n/α min(22t+2, |Sr|),

where the inequality in second line above follows from Eq. (13) and the inequality in the last line follows from Eq. (14). The last equality follows from the assumption that v 2 = 1. Therefore,

2 (n/α)1/4 if 2t+1 p

2 (n/α)1/4 p

|Sr| otherwise .

Published as a conference paper at ICLR 2024

Now by plugging the above inequalities into Eq. (12) we find that:

D 1 Sr A op p

t=0 2 t Bt v 2 +

|Sr| 2 t Bt v 2

6κ7/4α1/4 if r = 0

9κ if r = 1 ε 6κ if r = 2 ,

where the last line above follows from the upper bound on the size of sets Sr we obtained in Eq. (5) and Eq. (8). Finally, by plugging the above inequality into Eq. (10) and using the fact that D 1A is a row-stochastic matrix and thus D 1A op 1 the lemma follows.

Next, we prove the main theorem.

Proof of Theorem 1: The diagonal matrix e D in line 2 is computed by invoking Algorithm 2. By Lemma 1 with probability at least 1 1 poly(n), it holds that e D 1 D 1 A op ε/2 D 1A op. Furthermore, in line 3 of the algorithm S is defined as the sampling matrix according

to the row norms of V . To invoke Lemma 2, we need to have a bound on the stable rank of e D 1A.

First, from Lemma 1 we know that e D 1A op (1 ε/2) D 1A op 1/2, where the second

inequality follows because D 1A is a row-stochastic matrix. Therefore we have srank( e D 1A)

F . Second, the lower capping in line 8 of Algorithm 2 ensures that e D 1A 2

κ2 D 1A 2 F no(1) D 1A 2 F no(1), thus srank( e D 1A) no(1).

With this bound on the stable rank and since m = d no(1) = Ω ε 2d srank( e D 1A) ,

Lemma 2 implies that e D 1AS SV e D 1AV op ε/3 e D 1A op V op ε/2 D 1A op V op with probability 0.99.

By combining these two inequalities, we obtain the spectral approximation guarantee in Eq. (1). The runtime of this algorithm is primarily determined by the time it takes to invoke Algorithm 2 in line 2, which is dominated by the time to multiply two matrices of sizes n d and d m and the time to calculate attention matrix entries at the positions defined by M H. Using the matrix multiplication result from (Le Gall, 2012), the first computation can be done in time O(dn1+o(1)) and the latter can be done in nnz(M H).

Proof of Corollary 1: Because r = log2 n, we have Pr[H(Qi, ) = H(Kj, )] 1/n1 o(1) whenever θ(Qi, , Kj, ) π

2 (1 o(1)). As there are at most n2 total pairs, the expected number of such pairs that collide under H is at most n1+o(1) and so by a Markov bound is at most n1+o(1) with failure probability 1/no(1).

Since we also assume there are at most n1+o(1) pairs (i, j) with θ(Qi, , Kj, ) < π

2 (1 o(1)), there can be at most n1+o(1) additional pairs that collide.

Thus, in total we have n1+o(1) collisions, and consequently the number of non-zero entries in M H

is at most n1+o(1) with failure probability 1/no(1). The proof now follows by the assumptions of the corollary statement as well as Theorem 1.

Proof of Corollary 2: Let T be an Expander Sketch matrix with O(τ log n) rows. By the guarantees (Larsen et al., 2016) of T , we have that with probability at least 1 1

n3 , for any fixed vector x Rn,

that from T x, one can recover a set S of indices i [n] such that if x2 i x 2 2 τ , then i S,

whereas if x2 i x 2 2 2τ , then i / S. Further, S can be found from T x in no(1) time.

Published as a conference paper at ICLR 2024

We compute T Q, followed by (T Q) K . Note that the time for this computation is O(τn log n). Next, for each column j of QK this allows us to construct a set Sj with the property that if

(Q K )i,j Q K ej 2 2 τ , then i Sj. This holds simultaneously for all columns with probability at least 1 1 n2 by a union bound. The time for constructing all the sets Sj is n1+o(1)d

Note that |Sj| 2τ for all j, and we can explicitly compute the exact value of (Q K )i,j for all i Sj and all j, in O(nτd) time. By the assumptions of the corollary, we have that Sj contains a superset of the support of the j-th column of M H, and since we can compute the values exactly, we can exactly construct the mask M H matrix that the corollary requires, and in n1+o(1)d time. The proof now follows by the assumptions of the statement as well as Theorem 1.