# linear_lognormal_attention_with_unbiased_concentration__55ce67f7.pdf

Published as a conference paper at ICLR 2024

LINEAR LOG-NORMAL ATTENTION WITH UNBIASED CONCENTRATION

Yury Nahshan, Joseph Kampeas and Emir Haleva Distributed and Parallel Software Lab, Huawei Technologies Email: {first.last}@huawei.com

Transformer models have achieved remarkable results in a wide range of applications. However, their scalability is hampered by the quadratic time and memory complexity of the self-attention mechanism concerning the sequence length. This limitation poses a substantial obstacle when dealing with long documents or highresolution images. In this work, we study the self-attention mechanism by analyzing the distribution of the attention matrix and its concentration ability. Furthermore, we propose instruments to measure these quantities and introduce a novel self-attention mechanism, Linear Log-Normal Attention, designed to emulate the distribution and concentration behavior of the original self-attention. Our experimental results on popular natural language benchmarks reveal that our proposed Linear Log-Normal Attention outperforms other linearized attention alternatives, offering a promising avenue for enhancing the scalability of transformer models.

1 INTRODUCTION

Transformer models, proposed by (Vaswani et al., 2017), have become widely used deep learning architectures that have achieved state-of-the-art performance in various fields, including Natural Language Processing (NLP) (Brown et al., 2020; Devlin et al., 2018), Computer Vision (CV) (Dosovitskiy et al., 2020), Neural Machine Translation (NMT) (Chen et al., 2018), Document Summarization (Zhang et al., 2019; Pilault et al., 2020), and Protein Structure Prediction (Bahdanau et al., 2015). The main component of the Transformer model is an attention mechanism that identifies complex dependencies between tokens and efficiently captures tokens correlation. However, standard self-attention suffers from quadratic memory and computation complexity, which arises from the N N attention matrix, where N is the sequence length. This problem is particularly significant during training, as it requires storing the attention matrix for gradient computation. Consequently, this significantly hinders the training of Transformer models with long sequences.

Recently, we have observed an increasing interest in training Transformer models with long sequences, especially when considering large language models (Scao et al., 2022; Zhang et al., 2022; Chowdhery et al., 2022). Various approaches address the quadratic memory issue in self-attention. One class of the methods is sparse-attention, which aims to perform only a subset of the attention computations while preserving the softmax function (Child et al., 2019; Zaheer et al., 2020). Another approach is Linearized Attention (LA), which replaces the softmax with a product of two functions (Choromanski et al., 2020; Katharopoulos et al., 2020). These methods reduce the computational and memory complexity of the attention mechanism while striving to maintain performance. One of LA s benefits is that it performs dense operations and does not require special HW or low-level implementation. However, despite their efficiency, LA methods often underperform compared to standard self-attention. Thus, understanding the reasons behind the superior performance of selfattention is crucial for designing an effective LA method.

In this paper, we propose a systematic way to develop an LA method with comparable performance to the Softmax Attention (SA). First, we define a holistic model of the SA and examine its characteristics. Then, we analyze the SA from three different perspectives, focusing on its statistical, informational, and algebraic properties. In particular, we characterize the probability distribution of the attention matrix and prove its log-normal nature. Moreover, we study the concentration behavior of the SA by analyzing its entropy and the spectral gap (Coste, 2017). Based on the proposed

Published as a conference paper at ICLR 2024

model, we develop an LA method that emulates the distribution and concentration behavior of the SA, achieving comparable performance. Finally, we evaluate the effectiveness of our method on popular NLP benchmarks and compare it with other state-of-the-art methods. In summary, our contribution is as follows:

We conduct an in-depth analysis of self-attention, characterizing its statistical, informational, and algebraic properties.

Develop tools suitable for studying the concentration ability of the attention based on the entropy and the spectral gap metrics.

Using the developed model and tools, we design Linear Log-Normal Attention (LLN Attention) with comparable performance to SA while requiring linear memory and computational complexity in the sequence length.

We have made the code of our method available for Mind Spore1 and Py Torch2 frameworks.

2 BACKGROUND AND RELATED WORK

In this section, we present a brief overview of the attention mechanism and various methods for efficient and linearized attention. We review the most relevant works in the field, classifying them into different types of attention methods such as sparse attention, low-rank projection, memorybased, kernel-based approximations, and clustering-based methods.

2.1 BACKGROUND ON SELF-ATTENTION

In the seminal study of (Bahdanau et al., 2015), the authors proposed the attention mechanism, which was subsequently incorporated into the Transformer model (Vaswani et al., 2017). Since then, attention has become a fundamental building block for many Transformer-based models.

Consider a sequence of N tokens, where each token represented by d-dimensional query, key, and value vectors, denoted as {qi}N i=1, {ki}N i=1, and {vi}N i=1, respectively. The SA is defined as:

Attn(qi, {k1, . . . ,k N}, {v1, . . . ,v N}) =

PN j=1 κexp(qi,kj)v j PN l=1 κexp(qi,kl) (1)

where κexp is an exponential kernel used in the softmax function:

κexp(qi,kj) = e

The recent study by (Wright & Gonzalez, 2021) has examined SA from the perspective of the kernel method. Notably, the formulation of SA in eq. (1) can be seen as Nadaraya-Watson kernel regression (Nadaraya, 1964), where estimating some unknown function with joint distribution p(k,v) and density p(k) with a kernel (Han et al., 2022). Moreover, as shown by (Tsai et al., 2019), other popular kernels, such as polynomial or Radial Basis Function (RBF), can be used instead of the exponential kernel. However, the performance may vary depending on the type of the kernel. A kernel method perspective of the attention allows us to address the problem of attention linearization by using the connection between any kernel and its feature embedding function Φ, described by Mercer s theorem (Mercer, 1909):

κ(qi,kj) = Φ(qi), Φ(kj) (3)

1gitee.com/ynahshan/linear-lognormal-attention-ms 2github.com/ynahshan/linear-lognormal-attention

Published as a conference paper at ICLR 2024

2.2 LINEARIZED ATTENTION

In recent years, several techniques have been proposed to address the quadratic cost associated with SA. Based on the taxonomy by (Zhu et al., 2021), these techniques can be categorized into five types: i) sparse attention mechanisms with predefined patterns, including sliding window approaches such as Sparse Transformer (Child et al., 2019), Axial Transformer (Ho et al., 2019), Blockwise Attention (Qiu et al., 2019), Longformer (Beltagy et al., 2020), and Big Bird (Zaheer et al., 2020), where some of these works (Wang et al., 2021) manage to improve model convergence due to noise reduction; ii) low-rank projection methods, including Linformer (Wang et al., 2020), Synthesizer (Tay et al., 2020a), Nystrom Former (Xiong et al., 2021), Sky Former (Chen et al., 2021), and Cosformer (Qin et al., 2022b); iii) memory-based methods, such as Set Transformer (Lee et al., 2018b) and Compressive Transformers (Rae et al., 2019); iv) kernel-based approximation of the attention matrix, including Performer (Choromanski et al., 2020), Linear Transformers (Katharopoulos et al., 2020), and RFA (Peng et al., 2021); and v) similarity and clustering methods, including Reformer (Kitaev et al., 2020), Routing Transformer (Roy et al., 2020), Sinkhorn Attention (Tay et al., 2020b), and Clustered Attention (Vyas et al., 2020).

Some of these methods combine multiple types of efficient attention mechanisms. For instance, (Zhu et al., 2021) suggested a combination of low-rank projection and local window attention. Similarly, (Qin et al., 2022a) incorporate both kernel-based and block-wise attention in their approach. Our method also integrates kernel and block-wise techniques while suggesting a novel kernel approach that differs from that of (Qin et al., 2022a). By leveraging the benefits of multiple attention mechanisms, these techniques offer more efficient and accurate models for various NLP tasks.

Kernel-based attention requires selecting a feature embedding function Φ to compute the LA kernel eq. (3). Linearized attention can then be defined as:

Attnlin(qi, {kj}N j=1, {vj}N j=1) =

PN j=1 ΦQ(qi) ΦK(kj) PN l=1 ΦQ(qi) ΦK(kl) v j = ΦQ(qi) PN j=1 ΦK(kj)v j ΦQ(qi) PN l=1 ΦK(kl) (4)

The choice of feature embedding function is crucial, as we demonstrate later in section 4. Different works suggest different types of embedding functions for this purpose. For example, Performer (Choromanski et al., 2020) uses an exponential function, Linear Transformers (Katharopoulos et al., 2020) uses the ELU function, and RFA (Peng et al., 2021) uses trigonometric functions to approximate the Gaussian kernel with Fourier features. However, none of these works have analyzed the properties of the attention mechanism induced by these functions.

3 DISSECTING SOFTMAX ATTENTION

In the previous section, we discussed the LA concept. Although this concept may seem straightforward, creating an LA mechanism that effectively handles complex tasks presents a challenging problem. Typically, the LA of the form in eq. (4) performs worse than the SA. To gain insight into the superiority of the SA, we conduct a thorough analysis of its properties. We start by characterizing the distribution of the attention matrix since it is a core element of the attention mechanism. Then, we study the connection between its entropy, spectral gap, and concentration ability of the self-attention.

We begin our analysis by formalizing a model based on the SA from Equation (1). Our model assumes that queries and keys approximately follow a Gaussian distribution. This assumption is reasonable due to the Central Limit Theorem (CLT) (Lee et al., 2018a) and accepted in literature for tractability purposes (Ioffe & Szegedy, 2015; Banner et al., 2018). Moreover, let us assume the mean of queries and keys is approximately zero, which is a valid assumption due to the layer normalization presence in the Transformer models.

Model Definition. Let qi,kj Rd be a Gaussian vectors, where elements qiℓ N(0, σ2 q) and kjℓ N(0, σ2 k), l. Let aij = q i kj /

d the attention score of pair i, j, whose variance can be expressed as σ2 sm = σ2 qσ2 k + Ccross, where Ccross is the cross-covariance of the squared queries and keys (Goodman, 1960). We define a temperature of the SA as:

Published as a conference paper at ICLR 2024

τsm = 1 σsm = 1 q

σ2qσ2 k + Ccross (5)

Denote aij = aij / σsm and let P (SM) RN N the SA matrix, where N is sequence length such that:

P (SM) ij = e aij/τsm PN l=1 e ail/τsm (6)

The form in Equation (6) is significant as it demonstrates the connection between SA and implicit temperature parameter imposed by the variance of attention inputs. We can draw an analogy between the SA training and stochastic processes, where controlling the temperature allows balancing between exploration and exploitation. High temperature results in equal probabilities for all tokens (exploration), whereas low-temperature results in a high probability for one or few tokens, emphasizing it (exploitation of this particular token).

3.1 CHARACTERIZING DISTRIBUTION OF SOFTMAX ATTENTION

Let us now characterize the probability distribution of the SA. By analyzing its probability distribution, we can gain valuable insights into the behavior of the SA and reveal its statistical properties. In particular, the distribution of P (SM) plays a crucial role in quantifying the variability of its entries. This variability is closely related to the concentration ability of the SA, a topic that we will explore in more detail in subsequent sections.

Proposition 3.1. Let q and k be Gaussian vectors, where qi N(0, σ2 q) and kj N(0, σ2 k), i, j. Then, for moderate values of σ2 q, σ2 k and large enough N the distribution of P (SM) can be approximated by a log-normal distribution with parameters µsm = ln N 1

2σ2 sm and σ2 sm = σ2 qσ2 k + Ccross.

The key behind the proof is to approximate the denominator in eq. (6) with log-normal distribution by (Fenton, 1960) theorem. Then, since the numerator is also log-normal by the CLT, the resulting ratio can be approximated by a log-normal distribution. It leads to the log-normal distribution of the P (SM). The detailed proof is given in Appendix A.5.

The log-normal probability distribution of the SA matrix helps us understand the attention mechanism. The skewness of log-normal distribution emphasizes certain positions and enables concentration. The temperature parameter controls uncertainty, influencing the balance between exploration and exploitation during training.

3.2 ANALYZING SELF-ATTENTION THROUGH MARKOV CHAIN PERSPECTIVE

To delve deeper into the self-attention mechanism, we draw inspiration from the principles of Markov chains (Levin et al., 2006). Each row of the self-attention matrix represents the correlation between a specific token and all other tokens. These correlations closely resemble transition probabilities in classical Markov chains. The self-attention matrix continuously evolves during training, eventually converging to a final model.

3.2.1 ENTROPY AND ATTENTION CONCENTRATION

A crucial parameter in understanding such a stochastic system is its entropy, a metric commonly used to measure the uncertainty or randomness associated with the state transitions of a Markov chain. In the context of self-attention, entropy serves as a valuable tool to evaluate the concentration ability of the self-attention. We refer to this as Attention Concentration (AC), which essentially measures the model s ability to direct its focus toward specific tokens, thereby extracting relevant information from the input sequence. Previous studies (Ghader & Monz, 2017; Vig & Belinkov, 2019) have proposed using entropy to measure the AC. Lower entropy indicates a greater focus on a

Published as a conference paper at ICLR 2024

0 100 200 300 400 500 600 Step

Layer 0 1 2 3

0 100 200 300 400 500 600 Step

Layer 0 1 2 3

0 100 200 300 400 500 600 Step

Layer 0 1 2 3

Figure 1: Temperature (left), entropy (center), and spectral gap (right) during training of the small Ro BERTa model with a single head per layer in every training step (X-axis).

few tokens, while higher entropy indicates more uniformly distributed attention. To formally define the entropy of the attention matrix P (SM) we average the entropy of individual rows as following:

H(P (SM)) = 1

j=1 P (SM) ij log2(P (SM) ij ) (7)

Note that by eq. (6), the attention matrix can be represented in terms of its temperature. To further explore the connection between the AC and temperature, we present the following theorem, which characterizes the relationship between the entropy of the SA and its temperature:

Theorem 3.2. The entropy in eq. (7) is monotonically increasing with temperature τsm.

To prove the theorem, we consider the derivative of the entropy with respect to the temperature and show it is always positive. For detailed proof, refer to Appendix A.4.1.

According to Theorem 3.2, the entropy of the SA increases with the temperature, which controls the concentration of the SA. Essentially, a higher temperature results in a more dispersed distribution of attention. Conversely, a lower temperature makes it easier to focus on specific tokens. Moreover, the temperature controls the exploration (higher entropy) and exploitation (lower entropy) of the states within the chain. Figure 1a shows how temperature decreases during training, resulting in a more confident state (lower entropy) Figure 1b. Notably, while the first layers of the model retain high entropy, allowing exploration, the entropy of the middle layers decreases and becomes approximately zero, leading to exploitation in those layers.

3.2.2 SPECTRAL GAP AND ATTENTION CONCENTRATION

In the analysis of Markov chains, the spectral gap is a valuable metric to consider because it provides insights into the speed at which the chain reaches its stationary state (Coste, 2017). In other words, it quantifies the rate of convergence, where the larger values of the spectral gap indicate a faster convergence process, while a smaller one suggests a slower one. When applied to attention mechanisms, the spectral gap can provide insights into the rate at which the attention mechanism focuses on specific elements within the input sequence. Including the spectral gap in our analysis allows us a more comprehensive understanding of the SA mechanism from an algebraic perspective.

The spectral gap measures the difference between the first and the second largest eigenvalues. Since the attention matrix is a stochastic matrix (Meyer, 2000), it follows from the Perron-Frobenius theorem (Samelson, 1957) that its largest eigenvalue is λ1 = 1. Therefore, the spectral gap is γ = 1 |λ2|, where λ2 is the second largest eigenvalue. In Theorem 3.3, we establish a relationship between the variance of the attention matrix and the spectral gap.

Theorem 3.3. Let P RN N right stochastic matrix with eigenvalues λ1, . . . , λN ordered by their absolute values, where λ1 = 1 |λ2| |λN|. Let v v vmax be the major principal component of the centered version of P . Then, λ2 2 = σ2 v v vmax, where σ2 v v vmax represents the amount of variance in the direction specified by the major principal component v v vmax.

Published as a conference paper at ICLR 2024

To prove this theorem, we deflate the λ1 of the attention matrix and express the variance in the direction of the major principal component. The detailed proof is provided in Appendix A.4.2.

Theorem 3.4. The variance of the attention matrix P (SM) is monotonically decreasing with temperature τsm.

To prove this theorem, we consider the derivative of the variance with respect to the temperature and show it is always negative. For detailed proof, refer to Theorem A.3.

According to Theorem 3.3, the magnitude of variability in the direction of the major principal component v v vmax is equal to λ2, consequently, the spectral gap increases as the variability decreases. Together with the Theorem 3.4, we can conclude that the spectral gap increases with the temperature, similarly to the entropy. However, biasing the stochastic matrix towards a particular column also affects the variability. When P is biased toward a specific column, the variability within the columns decreases, resulting in a smaller value of λ2 and a higher spectral gap, regardless of the temperature. Therefore, we can conclude that the spectral gap only increases with temperature when the attention matrix is unbiased. This phenomenon led us to refer to the spectral gap as a measure of Unbiased Attention Concentration. Figure 1c depicts the change in the spectral gap during training. In most layers, the spectral gap decreases during training, indicating improved AC. However, in some layers, the spectral gap increases while the temperature remains constant, suggesting that the attention matrix is biased. This observation justifies that the spectral gap carries additional information to entropy.

4 DESIGN OF LINEARIZED ATTENTION

In the previous section, we presented a holistic model of SA and conducted a thorough analysis of its properties. Specifically, we identified the log-normal distribution of the SA matrix. Additionally, we analyzed concentration behavior dictated by the temperature parameter. We can measure AC using the entropy (biased) and the spectral gap (unbiased) metrics. In this section, we design the LA method based on the defined model, which resembles similar characteristics and imitates SA behavior. In particular, our LA model should have log-normal distribution with similar moments. Moreover, it should emulate the concentration pattern of the SA by matching its entropy and spectral gap curves. As a result, we expect our LA method to achieve performance comparable to the SA.

4.1 LINEAR LOG-NORMAL ATTENTION

Designing LA according to Equation (4) requires selecting a feature embedding function Φ, a core element of this attention. The choice of this function has a crucial effect on the LA performance. According to our model, we start by satisfying the log-normality requirement, as most functions do not have this property. For example, the Rectified Linear Unit (Re LU) can not produce log-normal distribution as being almost linear. On the other hand, the exponential function induces log-normal distribution for Gaussian inputs, which justifies its selection as a feature embedding function Φ. However, to match the concentration behavior of the SA, we must force the LA to produce similar entropy and spectral gap curves with respect to the temperature as in SA. To achieve this goal, we introduce additional parameters, which we tune to perform moment matching between the LA distribution and that of the SA.

Accordingly, let us denote by ΦQ(q) = eαq and ΦK(k) = eβk the feature embedding functions, where α, β R+ are hyper-parameters that must be carefully selected to ensure our LA closely approximates the SA. We define the Linear Log-Normal (LLN) Attention as:

Attn LLN(qi, {k1, . . . ,k N}, {v1, . . . ,v N}) =

PN j=1 eαq i eβkj PN l=1 eαq i eβkl v j (8)

Where each entry of the LLN attention matrix can be expressed as:

P (LLN) ij = eαq i eβkj PN l=1 eαq i eβkl (9)

Similarly to the analysis of the SA model, we assume zero mean of queries and keys. Then, to show that the LLN Attention matrix follows a log-normal distribution, we prove the following:

Published as a conference paper at ICLR 2024

0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00

Softmax Quadratic Re LU Quadratic linear Re LU linear LLN moment matching LLN = = 1

0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.0

Softmax Quadratic Re LU Quadratic linear Re LU linear LLN moment matching LLN = = 1

Figure 2: Comparison of entropy (left) and spectral gap (right) for various types of attention kernels. The figure shows that the entropy and spectral gap of the LLN Attention with the moment matching is similar to those of the SA.

Proposition 4.1. Let q and k be Gaussian vectors, where qi N(0, σ2 q) and kj N(0, σ2 k), i, j. Then, for moderate values of σ2 q and σ2 k, the distribution of P (LLN) can be approximated by a log-normal distribution with variance σ2 lln = a (α2σ2 q + β2σ2 k) + b, where a and b are constants.

The main steps of the proof are approximating the numerator and denominator in eq. (9) using the log-normal distribution, following the theorem by (Fenton, 1960). Then, split the analysis into three cases to express the variance, as suggested by (Romeo et al., 2003). The detailed proof is given in Appendix A.6.

Further, we have to ensure the concentration behavior of the LLN Attention is similar to that of the SA. To that end, it is necessary to determine appropriate values for the hyperparameters α and β. In the following, we estimate these parameters by performing moment matching to the distribution of the SA. Since the log-normal distribution is parameterized only by the first and second moments, we can align the LLN Attention distribution with the SA by ensuring equivalence of the first two moments.

Interestingly, Proposition 4.1 implies linear dependency between the variance of queries and keys and σ2 lln. This linear connection facilitates the calculation of constants a and b. It allows the application of linear interpolation on the randomly generated Gaussian samples q and k to perform the moment matching between LLN and SA. We provide a detailed description of the technical aspects of this moment-matching technique in Appendix A.7.

Finally, by requiring σlln = σsm and expressing it in terms of α and β, we can determine a and b parameters. We point out that there is no closed analytical solution for which both the mean and variance of LLN and SA align. Yet, since the concentration is mostly affected by the variance of the attention matrix, we only match the variances of the LLN and SA. Further, to simplify the solution, we also let α2σ2 q = β2σ2 k = 1

2 σ2. Hence, we obtain the following:

2σq ; β = σ

1 a(σ2qσ2 k b) (10)

A detailed derivation of Equation (10) is given in Appendix A.7. Note that, like in the SA, we can introduce a temperature parameter of the LLN Attention that controls the concentration. Specifically, let us define the temperature of the LLN Attention to be:

a (α2σ2q + β2σ2 k) + b (11)

In Figure 2, we demonstrate that the moment matching is essential to align the entropy and the spectral gap of the LLN Attention with those of the SA to achieve the required concentration. Moreover, other popular kernels, such as quadratic, Re LU, and their linear counterparts, are indifferent to the temperature, which may result in poor concentration and potentially degraded performance. In conclusion, LLN Attention satisfies the desired log-normal distribution property and concentration behavior required by the SA model. Therefore, it should achieve comparable results to the SA.

Published as a conference paper at ICLR 2024

4.2 THE OVERALL ARCHITECTURE

Figure 3: LLN Transformer layer architecture.

In this section, we present the experimental results of the LLN Attention method on NLP tasks, while more experiments on Image Classification and LRA(Tay et al., 2020c) benchmark available in the Appendix A.8.

The LLN Attention effectively scales to long sequences while maintaining high concentration, allowing capturing long-range interactions. However, for short-range connections, it may be less effective. Recently, a study by (Qin et al., 2022a) emphasized the attention dilution issue of LA methods. Specifically, LA may overlook neighboring structures, leading to the dilution of short-range interactions. To address this issue, the authors proposed a hybrid approach that combines LA with blockdiagonal attention, which retains the O(N) memory and computational complexity of LA. This block-diagonal attention is a regular SA applied on smaller pieces of the input, computing only the diagonal of the original attention matrix. Such block-diagonal attention can not scale to longer sequences, but it is useful to improve the performance of the LA method.

We incorporate this technique into LLN Attention, combining the LLN and block-diagonal attention into a unified layer by averaging the outputs of both components Figure 3. While the block-diagonal mechanism effectively captures short-range interactions within its confined block scope, LLN excels in catching broader, long-range connections. This combined approach enhances the performance of LLN Attention and stabilizes training by reducing the magnitude of the gradients (Qin et al., 2022a).

5 EXPERIMENTS

We first pre-train the bidirectional Ro BERTa encoder model (Liu et al., 2019) using LLN Attention on the Wiki Text-103 corpus (Merity et al., 2018). During pre-training, we monitor the convergence of the model and compare its performance to the SA model. In Appendix A.8.1, we show that the loss of the LLN Attention closely follows the loss of the SA, indicating similar convergence behavior.

Next, to evaluate the performance of LLN Attention on downstream tasks, we fine-tune our pretrained model on several NLP tasks from the General Language Understanding Evaluation (GLUE) dataset (Wang et al., 2018). These tasks include Multi-Genre Natural Language Inference (MNLI), Question-answering Natural Language Inference (QNLI), Quora Question Pairs (QQP), and Stanford Sentiment Treebank (SST-2). For all our experiments, we use the Fairseq framework (Ott et al., 2019) with the default configuration and hyperparameters of the Ro BERTa-base model.3

Table 1 provides a detailed comparison of the accuracy achieved by each method on each task. The LLN Attention method outperforms the other LA methods with an average accuracy of 86.9%. These results confirm the superior capability of LLN Attention in achieving competitive performance with SA on a range of NLP tasks.

5.1 SPEED AND MEMORY CONSUMPTION

In this section, we evaluate the training time and memory usage of the LLN Attention, comparing it to the SA and Nystr omformer, which outperform most of the LA methods available. In the comparison, we used the Ro BERTa-base model with a batch size of one and performed all measurements on a commodity GPU.

The results in Table 2 confirm that LLN Attention scales linearly with sequence length, as expected, and can handle at least four times longer sequences than SA. Moreover, the LLN Attention method

3https://github.com/facebookresearch/fairseq/blob/main/examples/roberta/README.md

Published as a conference paper at ICLR 2024

Method MNLI QNLI QQP SST-2 Avg SA baseline (Bahdanau et al., 2015) 80.3 87.2 89.9 90.6 87.0 Reformer (Kitaev et al., 2020) 35.4 - 63.2 50.9 49.8 Performer (Choromanski et al., 2020) 58.8 63.4 79.1 81.4 70.6 ELU (Katharopoulos et al., 2020) 74.8 82.5 86.9 87.2 82.8 Longformer (Beltagy et al., 2020) 77.2 - 85.5 88.6 83.7 Transformer LS (Zhu et al., 2021) 77.0 84.8 86.8 90.2 84.7 TNN (Qin et al., 2023) 76.72 85.06 88.3 90.6 85.17 T2 (Qin et al., 2022a) 77.28 85.39 88.56 90.71 85.48 Cos Former (Qin et al., 2022b) 76.7 - 89.2 91 85.6 T1 (Qin et al., 2022a) 79.06 87.0 88.61 91.17 86.46 Flash (Hua et al., 2022) 79.45 87.1 88.83 90.71 86.52 Nystr omformer (Xiong et al., 2021) 80.9(-1.5) 88.7(-1.6) 86.3(-1.) 91.4(+1.4) 86.8(-0.7) LLN Attention (Ours) 77.0 85.1 88.9 90.6 85.4 LLN+Diag Attention (Ours) 80.0 86.5 89.7 91.6 86.9

Table 1: Accuracy achieved by various LA methods on multiple NLP tasks from the GLUE dataset, including MNLI, QNLI, QQP, and SST-2. The results of Nystr omformer are given in (Xiong et al., 2021), while the results of the rest are given in (Qin et al., 2022a). Note that for methods marked with , which have a different baseline in the original paper, we also provide an accuracy drop in ().

sequence length Method 512 1024 2048 4096 8192 16384

Memory [GB]

Softmax Attention 4. 5.5 12.6 32.1 OOM OOM Nystr omformer 4. 4.5 5.5 7.3 11.6 19.1 LLN Attention 4.1 4.4 5.7 7.5 12. 20.1 LLN+Diag Attention 4.1 4.6 6.1 8.1 13.4 23.

Time [sec/it]

Softmax Attention 0.95 1.05 2.4 6.8 OOM OOM Nystr omformer 1.8 1.9 2.6 4.7 8.8 16.7 LLN Attention 1. 1.05 1.6 3.2 6.1 11.8 LLN+Diag Attention 1.2 1.3 1.9 3.6 6.9 13.3

Table 2: Memory usage and training time per iteration of SA, Nystr omformer, LLN, and LLN+Diag on Ro BERTa model, with varying sequence lengths. OOM indicates an Out Of Memory error.

requires nearly the same amount of memory as Nystr omformer, with Diag Attention adding only a 10% memory overhead to the LLN Attention. Notably, both LLN and LLN+Diag Attention demonstrate superior speed compared to Nystr omformer.

6 CONCLUSION

In this paper, we introduced a novel LLN Attention method that incorporates the essential properties of the SA, such as the log-normal distribution of the attention matrix and its concentration behavior, while offering linear time and memory complexity. Our approach includes a moment-matching technique to match the attention matrix s log-normal distribution with that of the SA, resulting in improved attention concentration and model performance. In addition, we conducted a comprehensive analysis of the SA, characterizing its distribution and suggesting entropy and the spectral gap metrics for attention concentration analysis. To the best of our knowledge, this is the first work to study self-attention from this perspective. Finally, our experimental results demonstrated that LLN Attention outperforms many existing LA methods on several NLP tasks, demonstrating its competitiveness and potential to enhance attention performance on long sequences. Overall, our contribution provides a foundation for future research and improvements in attention mechanisms.

ACKNOWLEDGEMENTS

We extend our gratitude to Dr. Eliezer Levy, Dror Mizrachi, Dr. Su Teng, Wang Sheng Nan, and Dror Meirovich for their valuable support and fruitful discussions.

Published as a conference paper at ICLR 2024

Dzmitry Bahdanau, Kyung Hyun Cho, and Yoshua Bengio. Neural machine translation by jointly learning to align and translate, January 2015. 3rd International Conference on Learning Representations, ICLR 2015 ; Conference date: 07-05-2015 Through 09-05-2015.

Ron Banner, Yury Nahshan, Elad Hoffer, and Daniel Soudry. ACIQ: analytical clipping for integer quantization of neural networks. Co RR, abs/1810.05723, 2018. URL http://arxiv.org/ abs/1810.05723.

Iz Beltagy, Matthew E. Peters, and Arman Cohan. Longformer: The long-document transformer, 2020. URL https://arxiv.org/abs/2004.05150.

Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. Advances in neural information processing systems, 33:1877 1901, 2020.

Mia Xu Chen, Orhan Firat, Ankur Bapna, Melvin Johnson, Wolfgang Macherey, George F. Foster, Llion Jones, Niki Parmar, Mike Schuster, Zhifeng Chen, Yonghui Wu, and Macduff Hughes. The best of both worlds: Combining recent advances in neural machine translation, 2018. URL http://arxiv.org/abs/1804.09849.

Yifan Chen, Qi Zeng, Heng Ji, and Yun Yang. Skyformer: Remodel self-attention with gaussian kernel and nystr om method, 2021. URL https://arxiv.org/abs/2111.00035.

Rewon Child, Scott Gray, Alec Radford, and Ilya Sutskever. Generating long sequences with sparse transformers, 2019. URL http://arxiv.org/abs/1904.10509.

Krzysztof Choromanski, Valerii Likhosherstov, David Dohan, Xingyou Song, Andreea Gane, Tam as Sarl os, Peter Hawkins, Jared Davis, Afroz Mohiuddin, Lukasz Kaiser, David Belanger, Lucy J. Colwell, and Adrian Weller. Rethinking attention with performers, 2020. URL https:// arxiv.org/abs/2009.14794.

Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Sebastian Gehrmann, et al. Palm: Scaling language modeling with pathways. ar Xiv preprint ar Xiv:2204.02311, 2022.

Simon Coste. The spectral gap of sparse random digraphs, 2017. URL https://arxiv.org/ abs/1708.00530.

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: pre-training of deep bidirectional transformers for language understanding, 2018. URL http://arxiv.org/ abs/1810.04805.

Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. An image is worth 16x16 words: Transformers for image recognition at scale, 2020. URL https://arxiv.org/abs/2010.11929.

Leslie H. Fenton. The sum of log-normal probability distributions in scatter transmission systems, 1960.

Hamidreza Ghader and Christof Monz. What does attention in neural machine translation pay attention to?, 2017. URL http://arxiv.org/abs/1710.03348.

Leo A. Goodman. On the exact variance of products, 1960. URL https://www. tandfonline.com/doi/abs/10.1080/01621459.1960.10483369.

Xing Han, Tongzheng Ren, Tan Minh Nguyen, Khai Nguyen, Joydeep Ghosh, and Nhat Ho. Robustify transformers with robust kernel density estimation, 2022. URL https://arxiv.org/ abs/2210.05794.

Jonathan Ho, Nal Kalchbrenner, Dirk Weissenborn, and Tim Salimans. Axial attention in multidimensional transformers, 2019. URL http://arxiv.org/abs/1912.12180.

Published as a conference paper at ICLR 2024

Weizhe Hua, Zihang Dai, Hanxiao Liu, and Quoc V. Le. Transformer quality in linear time, 2022.

Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. Co RR, abs/1502.03167, 2015. URL http://arxiv.org/ abs/1502.03167.

Ian Jolliffe. Principal component analysis, 2011. URL https://doi.org/10.1007/ 978-3-642-04898-2_455.

Angelos Katharopoulos, Apoorv Vyas, Nikolaos Pappas, and Franc ois Fleuret. Transformers are rnns: Fast autoregressive transformers with linear attention, 2020. URL https://arxiv. org/abs/2006.16236.

Nikita Kitaev, Lukasz Kaiser, and Anselm Levskaya. Reformer: The efficient transformer, 2020. URL https://arxiv.org/abs/2001.04451.

Jaehoon Lee, Yasaman Bahri, Roman Novak, Samuel S. Schoenholz, Jeffrey Pennington, and Jascha Sohl-Dickstein. Deep neural networks as gaussian processes, 2018a.

Juho Lee, Yoonho Lee, Jungtaek Kim, Adam R. Kosiorek, Seungjin Choi, and Yee Whye Teh. Set transformer, 2018b. URL http://arxiv.org/abs/1810.00825.

David A. Levin, Yuval Peres, and Elizabeth L. Wilmer. Markov chains and mixing times. American Mathematical Society, 2006. URL http://scholar.google.com/scholar.bib?q= info:3wf9IU94ty MJ:scholar.google.com/&output=citation&hl=en&as_ sdt=2000&ct=citation&cd=0.

Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. Roberta: A robustly optimized BERT pretraining approach, 2019. URL http://arxiv.org/abs/1907.11692.

J. Mercer. Functions of positive and negative type, and their connection with the theory of integral equations, 1909.

Stephen Merity, Nitish Shirish Keskar, and Richard Socher. An analysis of neural language modeling at multiple scales, 2018. URL https://arxiv.org/abs/1803.08240.

C.D. Meyer. Matrix Analysis and Applied Linear Algebra. Society for Industrial Mathematics, 2000. URL http://www.matrixanalysis.com/Download Chapters.html.

Elizbar A Nadaraya. On estimating regression, 1964.

Myle Ott, Sergey Edunov, Alexei Baevski, Angela Fan, Sam Gross, Nathan Ng, David Grangier, and Michael Auli. fairseq: A fast, extensible toolkit for sequence modeling. Co RR, abs/1904.01038, 2019. URL http://arxiv.org/abs/1904.01038.

Hao Peng, Nikolaos Pappas, Dani Yogatama, Roy Schwartz, Noah A. Smith, and Lingpeng Kong. Random feature attention, 2021. URL https://arxiv.org/abs/2103.02143.

Jonathan Pilault, Raymond Li, Sandeep Subramanian, and Chris Pal. On extractive and abstractive neural document summarization with transformer language models, November 2020. URL https://aclanthology.org/2020.emnlp-main.748.

Zhen Qin, Xiao Dong Han, Weixuan Sun, Dongxu Li, Lingpeng Kong, Nick Barnes, and Yiran Zhong. The devil in linear transformer, 2022a. URL https://arxiv.org/abs/2210. 10340.

Zhen Qin, Weixuan Sun, Hui Deng, Dongxu Li, Yunshen Wei, Baohong Lv, Junjie Yan, Lingpeng Kong, and Yiran Zhong. cosformer: Rethinking softmax in attention, 2022b. URL https: //arxiv.org/abs/2202.08791.

Zhen Qin, Xiaodong Han, Weixuan Sun, Bowen He, Dong Li, Dongxu Li, Yuchao Dai, Lingpeng Kong, and Yiran Zhong. Toeplitz neural network for sequence modeling. In The Eleventh International Conference on Learning Representations, 2023. URL https://openreview. net/forum?id=Ixm Wsm4xrua.

Published as a conference paper at ICLR 2024

Jiezhong Qiu, Hao Ma, Omer Levy, Scott Wen-tau Yih, Sinong Wang, and Jie Tang. Blockwise selfattention for long document understanding, 2019. URL http://arxiv.org/abs/1911. 02972.

Jack W. Rae, Anna Potapenko, Siddhant M. Jayakumar, and Timothy P. Lillicrap. Compressive transformers for long-range sequence modelling, 2019. URL http://arxiv.org/abs/ 1911.05507.

M. Romeo, V. Da Costa, and F. Bardou. Broad distribution effects in sums of lognormal random variables, apr 2003. URL https://doi.org/10.1140%2Fepjb%2Fe2003-00131-6.

Aurko Roy, Mohammad Saffar, Ashish Vaswani, and David Grangier. Efficient content-based sparse attention with routing transformers, 2020. URL https://arxiv.org/abs/2003.05997.

Hans Samelson. On the perron-frobenius theorem., 1957.

Teven Le Scao, Angela Fan, Christopher Akiki, Ellie Pavlick, Suzana Ili c, Daniel Hesslow, Roman Castagn e, Alexandra Sasha Luccioni, Franc ois Yvon, Matthias Gall e, et al. Bloom: A 176bparameter open-access multilingual language model. ar Xiv preprint ar Xiv:2211.05100, 2022.

Yi Tay, Dara Bahri, Donald Metzler, Da-Cheng Juan, Zhe Zhao, and Che Zheng. Synthesizer: Rethinking self-attention in transformer models, 2020a. URL https://arxiv.org/abs/ 2005.00743.

Yi Tay, Dara Bahri, Liu Yang, Donald Metzler, and Da-Cheng Juan. Sparse sinkhorn attention, 2020b. URL https://arxiv.org/abs/2002.11296.

Yi Tay, Mostafa Dehghani, Samira Abnar, Yikang Shen, Dara Bahri, Philip Pham, Jinfeng Rao, Liu Yang, Sebastian Ruder, and Donald Metzler. Long range arena: A benchmark for efficient transformers, 2020c. URL https://arxiv.org/abs/2011.04006.

Hugo Touvron, Matthieu Cord, Matthijs Douze, Francisco Massa, Alexandre Sablayrolles, and Herv e J egou. Training data-efficient image transformers & distillation through attention. Co RR, abs/2012.12877, 2020. URL https://arxiv.org/abs/2012.12877.

Yao-Hung Hubert Tsai, Shaojie Bai, Makoto Yamada, Louis-Philippe Morency, and Ruslan Salakhutdinov. Transformer dissection: An unified understanding for transformer s attention via the lens of kernel, 2019. URL http://arxiv.org/abs/1908.11775.

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need, 2017. URL http://arxiv. org/abs/1706.03762.

Jesse Vig and Yonatan Belinkov. Analyzing the structure of attention in a transformer language model, 2019. URL http://arxiv.org/abs/1906.04284.

Apoorv Vyas, Angelos Katharopoulos, and Franc ois Fleuret. Fast transformers with clustered attention, 2020. URL https://arxiv.org/abs/2007.04825.

Alex Wang, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel R. Bowman. GLUE: A multi-task benchmark and analysis platform for natural language understanding, 2018. URL http://arxiv.org/abs/1804.07461.

Pichao Wang, Xue Wang, Fan Wang, Ming Lin, Shuning Chang, Wen Xie, Hao Li, and Rong Jin. KVT: k-nn attention for boosting vision transformers. Co RR, abs/2106.00515, 2021. URL https://arxiv.org/abs/2106.00515.

Sinong Wang, Belinda Z. Li, Madian Khabsa, Han Fang, and Hao Ma. Linformer: Self-attention with linear complexity, 2020. URL https://arxiv.org/abs/2006.04768.

Eric W Weisstein. Normal product distribution, 2003.

Matthew A. Wright and Joseph E. Gonzalez. Transformers are deep infinite-dimensional non-mercer binary kernel machines, 2021. URL https://arxiv.org/abs/2106.01506.

Published as a conference paper at ICLR 2024

Yunyang Xiong, Zhanpeng Zeng, Rudrasis Chakraborty, Mingxing Tan, Glenn Fung, Yin Li, and Vikas Singh. Nystr omformer: A nystr om-based algorithm for approximating self-attention, 2021. URL https://arxiv.org/abs/2102.03902.

Manzil Zaheer, Guru Guruganesh, Avinava Dubey, Joshua Ainslie, Chris Alberti, Santiago Onta n on, Philip Pham, Anirudh Ravula, Qifan Wang, Li Yang, and Amr Ahmed. Big bird: Transformers for longer sequences, 2020. URL https://arxiv.org/abs/2007.14062.

Susan Zhang, Stephen Roller, Naman Goyal, Mikel Artetxe, Moya Chen, Shuohui Chen, Christopher Dewan, Mona Diab, Xian Li, Xi Victoria Lin, et al. Opt: Open pre-trained transformer language models. ar Xiv preprint ar Xiv:2205.01068, 2022.

Xingxing Zhang, Furu Wei, and Ming Zhou. HIBERT: document level pre-training of hierarchical bidirectional transformers for document summarization, 2019. URL http://arxiv.org/ abs/1905.06566.

Chen Zhu, Wei Ping, Chaowei Xiao, Mohammad Shoeybi, Tom Goldstein, Anima Anandkumar, and Bryan Catanzaro. Long-short transformer: Efficient transformers for language and vision, 2021. URL https://arxiv.org/abs/2107.02192.

Published as a conference paper at ICLR 2024

(a) Softmax attention

(b) Linear attention

Figure 4: A block diagram of the computational complexity for the Softmax Attention and Linearized Attention.

A.1 FORMULATION OF THE SELF-ATTENTION (SA)

One of the most widely used variants of the self-attention mechanism is the scaled dot-product attention (Vaswani et al., 2017). In this formulation, the input vectors [x1, . . . ,x N] := X RN d are projected into query, key, and value vectors as follows:

[q1, . . . ,q N] := Q = XW q RN d

[k1, . . . ,k N] := K = XW k RN d

[v1, . . . ,v N] := V = XW v RN d (12)

Here, W q,W k,W v Rd d are learnable parameter matrices. The attention function is then computed on the query, key, and value vectors using the following equation:

Attn(qi, {k1, . . . ,k N}, {v1, . . . ,v N}) =

j=1 softmax q i kj

The dot-product term is scaled by the factor 1

d to ensure the stability of computations. This scaling factor accounts for the variance of the dot-product q i kj, which grows with the dimensionality d. In this paper, we refer to this scaled dot-product attention as softmax attention or standard attention, which is defined in eq. (1).

A.2 LINEARIZED ATTENTION (LA)

Let Q,K,V RN d denote the queries, keys, and values of the attention mechanism. Computing softmax-attention with Equation equation 13 requires calculating the quadratic matrix QK = q i kj

N N, which has a complexity of O(N 2) with respect to the sequence length N as illustrated in Figure 4a.

Alternatively, if we can decompose softmax QK into Q K , we can use the associativity

property of matrix multiplication to compute Q K V from right to left. Such a decomposed computation has a linear complexity in the sequence length N, as illustrated in Figure 4b. To obtain this decomposition, the un-normalized matrix e QK is replaced with matrix Φ(Q)Φ(K) , where Φ is a feature map that is applied row-wise, i.e., Q = Φ(Q) and K = Φ(K). This form of linear attention can be expressed as follows:

Attnlin(qi, {k1, . . . ,k N}, {v1, . . . ,v N}) =

PN j=1 Φ(qi) Φ(kj) PN l=1 Φ(qi) Φ(kl) v j = Φ(qi) PN j=1 Φ(kj)v j Φ(qi) PN l=1 Φ(kl) (14)

Due to aggregation over N elements in both the numerator and denominator, the memory complexity of this linearized attention mechanism is linear with respect to the sequence length N.

Published as a conference paper at ICLR 2024

A.3 KERNEL VIEW OF THE SELF-ATTENTION

The SA mechanism can be viewed as a Nadaraya-Watson kernel regression (Nadaraya, 1964) as shown by Wright & Gonzalez (2021). Specifically, the kernel density estimation (KDE) of some unknown function with joint distribution p(k,v) and density p(k) with a kernel κ (Han et al., 2022). Therefore, the kernel function can be used to express the self-attention mechanism as follows:

Attn(qi, {k1, . . . ,k N}, {v1, . . . ,v N}) =

PN j=1 κ(qi,kj)v j PN l=1 κ(qi,kl) . (15)

he KDE form of Equation (15) is a generalization of softmax attention in Equation (13) where we use a softmax kernel

κsm(qi,kj) = e qi,kj

Interestingly, the kernel view of self-attention reveals that the functionality of the attention mechanism remains intact even when the kernel is modified. However, the performance of the attention mechanism can vary depending on the choice of the kernel function, as demonstrated by Tsai et al. (2019). It emphasizes the significance of considering different kernels and their impact on the performance of linearized attention. Additionally, the kernel view provides a valuable perspective for tackling the linearization problem within the framework of the kernel method.

According to Mercer s theorem (Mercer, 1909), any positive-definite kernel (Mercer kernel) can be represented as an inner product of symmetric features. Let X RN d, x1, ...,x N X with kernel function κ : X X R, then

κ(xi,xj) = Φ(xi), Φ(xj) FH (17)

where Φ : X FH is a function mapping the inputs to a Hilbert space of feature functions FH. However, the dimensionality of FH can be large or infinite, making explicit computation of the features infeasible. This is where the kernel trick comes in: the kernel function can be computed without explicitly computing the features. However, storing the attention matrix still requires O(N 2) memory complexity. To address this issue, we can design a kernel function such that the dimensionality of FH is much smaller than N, which allows for O(N) memory complexity. For example, consider ΦQ : Q Rd and ΦK : K Rd for some d N, the attention can be computed using the following linear kernel function:

κ(qi,kj) = ΦQ(qi), ΦK(kj) (18)

Consequently, using the associativity property described in section A.2 can be used to compute this kernel function with O(N) memory complexity.

A.4 ANALYSIS OF THE ENTROPY, VARIANCE AND SPECTRAL GAP OF THE SOFTMAX ATTENTION MARIX

We start our analysis by providing additional definitions which extend those in Section 3 and proving a couple of lemmas. Denote by a RN a single row of the attention scores matrix A. By denoting a corresponding row of the normalized attention scores matrix A A A by x, such that τa = x, we can write p RN a single row of the softmax attention matrix from eq. (6) as:

p = softmax(x, τ) = e x τ PN j=1 e

Where τ is the temperature of SA from eq. (5). Similarly to eq. (7), we define the entropy of a single row p as:

i=1 pi log2(pi) (20)

Published as a conference paper at ICLR 2024

Additionally, we define a variance of the single row p of the SA matrix as:

Denote µ = PN i=1 pixi

Lemma A.1. The following holds PN i=1 pi(xi µ) = 0

i=1 pi(xi µ) =

Lemma A.2. Let p as in eq. (19), then the following holds.

i=1 p2 i (xi µ) 0 (22)

Proof. By denoting δi = xi µ and substituting it into Equation (22) together with eq. (19), we obtain

i=1 p2 i (xi µ) = 1 PN j=1 e

τ δi = 1 PN j=1 e

Note that we can replace xi by δi due to the translation invariance of the softmax. Thus, it remains to show that:

By expressing the equation from Lemma A.1 in terms of δi we get

i=1 pi(xi µ) = 1 PN j=1 e

τ δi = 1 PN j=1 e

τ δi = 0 (23)

Split the sum in eq. (23) into two parts, first sum over the elements δi 0 and second over δi < 0, as following:

τ |{z} <1 e

i=1 p2 i (xi µ)

Published as a conference paper at ICLR 2024

Theorem A.3. The variance σ2 p is monotonically decreasing with temperature τ.

Proof. By taking the derivative of the σ2 x with respect to the τ we get.

σ2 p τ = 2 1

N pixi p2 i µ + 1

i=1 p2 i xi µ

i=1 pi(xi µ)

i=1 p2 i (xi µ)

Note that PN i=1 p2 i (xi µ) 0 as follows from the Lemma A.2.

Lemma A.4. The following holds

i=1 pix2 i µ2 =

i=1 pi(xi µ)2

i=1 pix2 i µ2 =

i=1 pi(xi µ + µ)2 µ2 =

i=1 pi(xi µ)2 + 2

i=1 pi(xi µ)µ + µ2 µ2 =

i=1 pi(xi µ)2 + 2µ

i=1 pi(xi µ)

i=1 pi(xi µ)2

A.4.1 PROOF OF THEOREM 3.2

Proof. To show that the entropy in eq. (7) is monotonically increasing with temperature, we first show that the entropy of the single row eq. (20) of the SA matrix is monotonically increasing. To that end, we take a derivative H

τ of the entropy with respect to the temperature and show that is always positive.

Denote S = PN j=1 e

Published as a conference paper at ICLR 2024

The derivative of the single entry of p with respect to the temperature is given by:

τ 2 pi(xi µ)

The derivative of the entropy from eq. (20) with respect to the temperature:

log2(pi) + 1 ln 2

log2(pi) + 1 ln 2

τ 2 pi(xi µ) =

i=1 pixi log2(pi)

i=1 pi log2(pi)µ + 1 ln 2

i=1 pi 1 ln 2µ

i=1 pixi log2(pi) µ

i=1 pi log2(pi)

i=1 pi log2(pi)(xi µ) =

τ log2 S (xi µ) =

i=1 pix2 i log2 S

i=1 pixi + log2 Sµ 1

i=1 pix2 i µ2 !

i=1 pi(xi µ)2 > 0

Where, follows from Lemma A.4.

Finally, since the entropy in eq. (6) is the average entropy of the rows, the Theorem 3.2 follows.

H(P (SM)) = 1

i=1 H(pi) (24)

A.4.2 PROOF OF THEOREM 3.3

The proof is based on the Principal Component Analysis (PCA) (Jolliffe, 2011) of the stochastic matrix P .

Proof. To start our proof, we first need to center matrix P in eq. (6) by deflating the first eigenvalue. Let µ = 1 N P 1 = 1 N PN i=1 P ij the vector mean of P . Note that v1 = 1 is the eigenvector of P corresponding to the first eigenvalue λ1 = 1 for which the following holds:

N 1 P 1 = 1

j=1 P ij = 1 (25)

Published as a conference paper at ICLR 2024

Thus, according to the Wielandt deflation theorem the eigenvalues of the matrix P = P λ1v1µ are 0, λ2, . . . , λN.

Furthermore, it is important to note that the matrix P is centered in both rows and columns as follows: N X

P ij = 1 P = 1 P λ11 v1µ = 1 P Nµ = 1 P 1 P = 0

P ij = P1 = P1 λ1v1µ 1 = 1 v1 1 N 1 P1 = 1 1 = 0

Therefore, the empirical covariance matrix of P can be expressed as:

ΣP = P λ1v1µ P λ1v1µ = P P

Denote by v v v1, . . . , v v v N the eigenvectors of P , where v v v1 corresponds to the eigenvalue λ1 = 0.

The variance in P along the direction of the principal component v v vi can be expressed as:

σ2 v v vi = v v v iΣP v v vi

v v v i v v vi = v v v i P P v v vi = P v v vi 2 = λ2 i v v vi 2 = λ2 i

Therefore, the variance is maximized when v v vi = v v v2, and λ2 2 = σ2 v v v2 represents the amount of variance in the direction specified by the largest principal component of P .

Thus, λ2 indicates the level of variability of matrix P along the direction specified by the major principal component of P . Moreover, according to Theorem A.3, if the rank of matrix P remains unchanged, the variability should decrease as the temperature increases. However, if the matrix P is biased towards a particular column, the variability within the columns decreases, resulting in a smaller value of λ2. Consequently, we can infer that the spectral gap, defined as γ = 1 |λ2| = 1 σ v v v2 1, increases with the temperature only when the stochastic matrix is unbiased.

A.5 PROOF OF PROPOSITION 3.1

Proof. Generally, the product of two independent Gaussian variables has a density in the form of a modified Bessel function of the second kind (Weisstein, 2003). When the vector dimensions are sufficiently large, the Central Limit Theorem implies that the distribution of the dot product between q and k can be approximated by a Gaussian distribution with zero mean and variance σ2. As mentioned in section 3, the variance of q k can be expressed as σ2 = σ2 qσ2 k + Ccross, where Ccross = Cov(q2,k2) Cov(q,k)2 is the cross-covariance of the squared queries and keys (Goodman, 1960).

Therefore, due to the exponent in eq. (6), the numerator is approximately a log-normal variable with zero mean and variance σ2. To address the denominator, we must consider a sum of log-normal variables. Fortunately, Fenton (1960) theorem states that for moderate values of σ2, the sum of zero-mean i.i.d. log-normal variables can be approximated by a log-normal variable with variance σ2 Σ and mean µΣ where:

σ2 Σ = ln 1

eσ2 1 + 1 ; µΣ = ln N + (σ2 σ2 Σ)/2

For large N and moderate values of σ2, we have σ2 Σ σ2, and we can omit the σ2 Σ term for simplicity. Finally, since the ratio of log-normal variables remains log-normal with mean µΣ and variance σ2, Proposition 3.1 follows.

To empirically validate the assumption made in Proposition 3.1, we measure the variance and mean of the SA and compare them to the values predicted by the theory. The results are presented in Figure 5a, which shows that the measured statistics closely match the theoretical predictions.

Published as a conference paper at ICLR 2024

Softmax Attention matrix statistics

Actual variance 2 sm Estimation 2 q

2 k Actual mean sm

Estimation ln N 1 2

0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 2

lln (Attention matrix)

Softmax Attention LLN Attention = = 1 LLN Attention with moment matchin

Figure 5: (a) The variance and mean of the SA matrix with respect to the input variance. Measurements perfectly match theoretical estimation. (b) The variance of the SA and LLN Attention before and after performing the moment matching procedure.

A.6 PROOF OF PROPOSITION 4.1

Proof. To demonstrate that the LLN attention matrix approximately follows a log-normal distribution, we examine a single entry of the attention matrix, which is a dot product of two vectors q and k. The nominator eαq, eβk = Pd i eαqi+βki represents a sum of d log-normal variables with zero mean and variance σ2, where

σ2 = α2σ2 q + β2σ2 k (26)

Similarly, the denominator PN j=1 eαq, eβkj of the LLN Attention matrix, which is also a sum of log-normal variables. According to (Fenton, 1960), for moderate values of σ2, the distribution of the sum of log-normal variables can be approximated by another log-normal distribution at the right tail. Since the ratio of the log-normal variables is also log-normal, we can approximate the distribution of the LLN Attention matrix by a log-normal distribution.

To determine the relationship between the variance of the LLN Attention matrix σ2 LLN and variances of queries and keys σ2 q, σ2 k, we need to estimate the variance of a sum of log-normal variables. Following the approach of (Romeo et al., 2003), we will divide our analysis into three cases: narrow σ2 1, moderate σ2 1 and broad σ2 > 1.

Denote variance of the sum in nominator by σ2 nom and variance of the denominator by σ2 den.

NARROW CASE

If 0 < σ2 q, σ2 k 1 then the values are small such that even close to zero, thus we can approximate eαqi 1 + αqi and eβki 1 + βki, thus:

σ2 nom d(α2σ2 q + β2σ2 k) = d σ2; σ2 den N(α2σ2 q + β2σ2 k) = N σ2 (27)

MODERATE CASE

When σ2 is relatively small (i.e., 1), we can use (Fenton, 1960) method. The approximation is given by:

eα2σ2 q+β2σ2 k 1

Published as a conference paper at ICLR 2024

0.4 0.6 0.8 1.0 1.2 2 of the variables

2 of the sum

Actual d = 64 Estimation d = 64 Actual d = 256 Estimation d = 256 Actual d = 512 Estimation d = 512

0 10 20 30 40 50 60 2 of the variables

2 of the sum

d=64 d=256 d=512 boundary of linear region

Figure 6: The variance of the sum of d log-normally distributed inputs, each with a variance of σ2 and a zero mean, is effectively estimated by the (Fenton, 1960) method. (a) In moderate scenario where σ2 lies within the range of [0.2, 1.2], theoretical estimations (depicted as dashed lines) aligned with empirical results. (b) In the broad case, for sufficiently large σ2 the variance of the log-normal sum grows linearly with the variance of the input variables.

Similarly, for the denominator:

In fig. 6a, we demonstrate the empirical evaluation of the correctness of Fenton approximation when σ2 [0.2, 1.2].

In the broad case, when σ2 is large (i.e., σ2 1), it is not possible to find a closed-form approximation for the log-normal sum. However, we can observe that the sum of exponents is dominated by the largest term, which corresponds to the maximum value. This maximum value grows linearly with the spread of queries and keys under the Gaussian assumption. Consequently, according to (Romeo et al., 2003), the resulting variance σ2 is also linearly proportional to σ2 with some constants a1, a2 and b1, b2: σ2 nom a1(α2σ2 q + β2σ2 k) + b1 = a1 σ2 + b1 (30)

σ2 den a2(α2σ2 q + β2σ2 k) + b2 = a2 σ2 + b2 (31)

We empirically evaluated the linear dependency assumption of the sum of log-normally distributed inputs and showed its validity in fig. 6b.

Finally, the variance of the LLN Attention matrix is a sum of the nominator and denominator variances, i.e.:

σ2 lln = σ2 nom + σ2 den (32)

By denoting a = a1 + a2 and b = b1 + b2 for a broad case we get:

σ2 lln = a σ2 + b = a(α2σ2 q + β2σ2 k) + b (33)

Published as a conference paper at ICLR 2024

10 8 6 4 2 0 0.00

2 q = 2 k = 1

Softmax Attention LLN Attention = = 1 LLN Attention moment matching

17.5 15.0 12.5 10.0 7.5 5.0 2.5 0.0 0.0

2 q = 2.25, 2 k = 1.44

Softmax Attention LLN Attention = = 1 LLN Attention moment matching

Figure 7: Histogram of the SA and LLN Attention before and after performing the moment matching.

A.7 MOMENT MATCHING

As shown in Figure 5b, we are interested in handling a broad range of σ2 values, particularly those in [1, 4]. Using Proposition 4.1, we can find the constants a and b that satisfy the requirement σ2 lln = σ2 sm of the broad case through moment matching. To do this, we perform linear interpolation between the variances of the LLN and Softmax Attention by injecting uncorrelated Gaussian inputs to both attentions and measuring their output variances according to the following equation:

σ2 sm = σ2 lln = a σ2 + b (34)

Once we have determined the values for a and b, we can substitute them into Equation (10) to obtain the optimal values for α and β.

As depicted in Figure 5b, the variance of the LLN Attention without moment matching (i.e., α = β = 1) is much smaller than that of the Softmax Attention and exhibits a nearly linear trend. However, the variance of the LLN Attention with moment matching approximates that of the Softmax Attention. Additionally, the histogram shown in Figure 7 suggests that the LLN Attention distribution closely follows that of the Softmax Attention. Despite that, the two distributions have slightly different means because we only match their variances.

A.8 EXPERIMENTS

In this section, we present more experimental results and ablation studies of our method.

A.8.1 PRE-TRAINING OF ROBERTA MODEL

We train the bidirectional Ro BERTa-base encoder model (Liu et al., 2019) using LLN Attention on the Wiki Text-103 corpus (Merity et al., 2018). During pre-training, we monitor the convergence of the model and compare its performance to the SA model. In Figure 8a we show the training and validation loss of the Ro BERTa-base model during pre-training with LLN Attention, as well as its comparison to SA. The loss curve of LLN Attention closely follows the SA, indicating similar convergence behavior. We used the Fairseq framework (Ott et al., 2019) for all experiments with the default configuration and hyperparameters of the Ro BERTa-base model.4

We perform the training with FP16 precision, which can cause instability during training. To test the stability of the training, we also log the inverse loss scale parameter Figure 8b. Spikes in the plot indicate a decrease in the loss scale due to large gradients. As can be seen from the figure, the maximum inverse scale of LLN Attention does not exceed that of the SA, which is important to ensure similar stability during training as with SA.

4https://github.com/facebookresearch/fairseq/blob/main/examples/roberta/README.md

Published as a conference paper at ICLR 2024

0 100 200 300 400 Epoch

Softmax Attention train LLN+Diag Attention train Softmax Attention val LLN+Diag Attention val

0 100 200 300 400 Epoch

Inverse scale

Softmax Attention LLN+Diag Attention

Figure 8: (a) Training and validation loss comparison of Ro BERTa-base model pre-training using LLN Attention and SA. (b) Inverse of the loss scale during training of Ro BERTa-base model.

A.8.2 IMAGE CLASSIFICATION

To test LLN Attention on Vision task, we evaluate it on the Vision Transformer model using vitpytorch 5 code base. Our Vi T model consists of twelve layers and 128 embedding sizes. We train this model for 100 epochs on Dogs vs Cats dataset 6 with LLN and Softmax Attention. The results in Table 3 show that LLN Attention performs on par with SA while outperforming the Linformer (Wang et al., 2020) method.

Softmax LLN+Diag Linformer 81.37 81.72 79.89

Table 3: Accuracy [%] of the Vi T model trained on Dogs vs Cats dataset with Softmax, LLN (ours) and Linformer (Wang et al., 2020) Attention.

A.8.3 LONG RANGE ARENA

Time[s] Memory[Mb] method TC LO RE PF IC TC LO RE PF IC Softmax 21468 5905 21866 6754 13228 17108 4458 8934 4817 9632 Reformer 4610 2439 4714 4694 8737 3261 1631 3008 3258 6514 Performer 3456 1966 3761 3553 13169 2176 1122 2178 2180 4353 Skyformer 4523 2970 5602 5240 9347 3068 1697 2974 4041 8079 LLN + Diag 3043 1774 3135 3042 4053 1641 821 1586 1639 3276

Table 4: Comparison of memory[Mb] and running time [s] of LLN Attention with Reformer(Kitaev et al., 2020), Performer(Choromanski et al., 2020) and Skyformer(Chen et al., 2021) linear attention methods and SA baseline.

We use the Long Range Arena (LRA) (Tay et al., 2020c) benchmark to evaluate LLN Attention on longer sequences. LRA benchmark requires a sequence length between 1k and 4k, depending on the task. To that end, we used a code base provided by Skyformer (Chen et al., 2021) 7. We compare the LRA score in addition to the memory and computation complexity of LLN Attention with Reformer(Kitaev et al., 2020), Performer(Choromanski et al., 2020), and Skyformer(Chen et al., 2021) linear methods as well as regular SA. According to the Table 4, LLN Attention requires much less

5https://github.com/lucidrains/vit-pytorch 6https://www.kaggle.com/competitions/dogs-vs-cats-redux-kernels-edition/data 7https://github.com/pkuzengqi/Skyformer

Published as a conference paper at ICLR 2024

memory and time compared to other methods while achieving a similar average LRA score as SA Table 5.

method Text (4k) List Ops (2k) Retrieval (4k) Pathfinder (1k) Image (1k) AVG Softmax 60.41 38.05 79.95 71.3 37.2 57.38 Reformer 61.27 37.05 78.74 67.23 44.04 57.67 Performer 57.85 37.8 80.5 62.58 37.56 55.26 Skyformer 60.88 39.36 81.54 70.23 32.64 56.93 LLN + Diag 60.72 38.91 81.21 69.81 38.65 57.86

Table 5: LRA score of LLN Attention with Reformer(Kitaev et al., 2020), Performer(Choromanski et al., 2020) and Skyformer(Chen et al., 2021) linear attention methods and SA baseline.

A.8.4 LLN ATTENTION CONCENTRATION - ABLATION STUDY

In this section, we analyze the impact of the LLN Attention temperature of the Vision Transformer (Vi T) model trained on the Dogs vs Cats dataset. First, we record the values of α and β produced by the moment matching procedure during training. According to Figure 9, the values of α and β obtained during moment matching lay within the range of (2; 2.2). Furthermore, since the temperature of the LLN Attention, as defined in Equation (11), decreases as the hyper-parameters α and β increase. To assess the influence of the temperature, we train the model with various fixed values of hyper-parameters α and β and record the resulting accuracy. In Figure 10a, we see that when α and β values are smaller than the moment matching range, i.e., less than 2, the LLN Attention concentration is insufficient due to the high temperature, leading to accuracy degradation. Conversely, when α and β values lay within the range of moment matching values or larger (α, β 2), the concentration is sufficient for the model to achieve the desired accuracy.

0 1000 2000 3000 4000 5000 6000 Iteration

layer0 layer1 layer2 layer3 layer4 layer5

layer6 layer7 layer8 layer9 layer10 layer11

0 1000 2000 3000 4000 5000 6000 Iteration

layer0 layer1 layer2 layer3 layer4 layer5

layer6 layer7 layer8 layer9 layer10 layer11

Figure 9: Change in the α parameter (a) and β (b) during training of Vi T model on Dogs vs Cats dataset.

We highlight the risks associated with surpassing the moment matching range by increasing α and β. In particular, larger values of these parameters may risk the stability of the training process due to increased gradients, a concern that becomes especially noticeable when training models in Float16 format. The risk of utilizing the Float16 data type stems from the reduced precision, smaller dynamic range, risks of gradient overflow, and the requirement to maintain the loss scaling. Moreover, the lower precision of Float16 may result in information loss during computations and numerical instability.

Accordingly, exceeding the moment matching values of α and β is practically undesirable, particularly in the context of training with Float16. In Figure 10b, we illustrate this phenomenon by presenting the loss scale during the training of the deit-tiny model(Touvron et al., 2020) in the Float16 format. We see that for large values of α = β = 4, the inverse of the loss scale is significantly larger,

Published as a conference paper at ICLR 2024

0.5 1.0 1.5 2.0 3.0 4.0 ,

Moment matching accuracy Fixed = Moment matching range

0 50 100 150 200 250 300 epoch

invers scale

Moment matching

Figure 10: (a) Accuracy of Vi T trained on Dogs vs Cats dataset with LLN Attention and different values of α and β. (b) Inverse of the loss scale during training of deit-tiny model for different fixed values of α and β as well as for moment matching.

compared to α = β = 2, indicating increased gradients and the potential of training instability or even failure. Therefore, to achieve the desired accuracy and allow stable training, it is crucial to maintain the temperature in the sweet spot specified by the moment matching values.